Genomic sequence of NGR234 symbiotic plasmid, its gene map, and its use in diagnostics and gene transfer in agriculture

ABSTRACT

The sequencing and analysis of the complete nucleotide sequence of symbiotic plasmid pNGR234a isolated from Rhizobium sp. NGR234. The complete sequence of pNGR234a is presented. The analysis includes the identification of a number of novel ORFs and the proteins expressible therefrom which have been ascribed putative functions.

TECHNICAL FIELD

[0001] This invention relates to a symbiotic plasmid of the broad host-range Rhizobium sp. NGR234 and its use. In particular, this invention relates to the isolation and analysis of the complete sequence of the NGR234 symbiotic plasmid pNGR234a, and the open reading frames (ORFs) identifiable therein as well as the proteins expressible from said ORFs.

BACKGROUND OF THE INVENTION

[0002] Together with carbon, hydrogen and oxygen, nitrogen is one of the essential components in organic chemistry. Although it is present in vast quantities in the atmosphere, nitrogen in its diatomic form N₂ remains unassimilable by living organisms. The nitrogen cycle begins by the fixation of nitrogen into ammonia which is chemically more reactive and can be assimilated into the food chain. A large fraction of the total nitrogen fixed every year is produced by microorganisms. Among these, the soil bacteria of the genera Azorhizobium, Bradyrhizobium, Sinorhizobium and Rhizobium, generally referred to as rhizobia, fix nitrogen in symbiotic associations with many plants from the Leguminosae family. This highly specific interaction leads to the formation of specialized root-, and in the case of Azorhizobium, stem-structures called nodules. It is within these nodules that rhizobia differentiate into bacteroids capable of fixing atmospheric nitrogen into ammonia. In turn, ammonia diffuses into the vegetal cells and sustains plant growth even under limiting nitrogen conditions.

[0003] The Rhizobium-legume interaction presents many interesting features. Obviously, the possibility of using this symbiosis as an “environmentally friendly” way to provide some of the most important world crops (such as soybean, bean and many other legumes) with fixed nitrogen without using nitrate-rich fertilizers, has important economic consequences. It is also an ideal model to study a non-pathogenic interaction between bacteria and a highly developed, multicellular organism such as the host plant. Furthermore, the various steps involved in the establishment of a functional nitrogen symbiosis, which include some dramatic morphological changes as well as processes of cellular differentiation, require a complex exchange of molecular signals. Despite many decades of studies, it is only recently that the Rhizobium-legume interaction has been partially understood at the molecular level. The establishment of a functional symbiosis can be divided into two major steps as follows.

[0004] (A) Rhizosphere Ecology and Nodulation

[0005] Rhizobia are soil bacteria that proliferate in the rhizosphere of compatible plants, taking advantage of the many compounds released by plant roots. In return it has been shown that the presence of rhizobia in the rhizosphere reduces susceptibility of plants to many root diseases. In the case of low nitrogen levels in the soil, compatible rhizobia can interact with host plants and start the nodulation process (Long, 1989; Fellay et al., 1995; van Rhijn and Vanderleyden, 1995). Molecular signalling between the two partners begins with the release by the plant of phenolic compounds (mostly flavonoids) that induce the expression of nodulation genes (referred to as nod, nol and noe genes). The NodD1 gene product appears to be the central mediator between the plant signal and nodulation gene induction (Bender et al., 1988). It is modified by the binding of flavonoids and acts as a positive regulator on the expression of the remaining nodulation genes. Among them, the nodABC loci encode products responsible for the synthesis of the core structure of lipooligosaccharides called Nod factors (Relic et al., 1994). More nodulation genes are involved in strain-specific modifications of the Nod factors as well as in its secretion. It seems established now that variability in the structure of Nod factors may play a significant role in the determination of the host-range of a given Rhizobium strain, that is in its ability to efficiently nodulate different legumes. For example, the strain Rhizobium meliloti can only nodulate Medicago, Melilotus and Trigonella ssp., whereas Rhizobium sp. NGR234 can symbiotically interact with more than 105 different genera of plants, including the non-legume Parasponia andersonii.

[0006] The structure of many Nod factors, their isolation from Rhizobium strains and their commercial application in agriculture have been described (NodNGR-Faktoren: Relić et al., 1994; WO 94/00466; NodRm-Faktoren: WO 91/15496). Secreted Nod factors act in turn as signal molecules that allow rhizobia to enter young root hairs of a host plant, and induce root-cortical cell division that will produce the future nodule. Invaginated rhizobia progress towards the forming nodule within infection threads that are synthesized by the plant cells. Bacteria are then released into the cytoplasm of dividing nodule cells where they differentiate into bacteroids capable of fixing atmospheric nitrogen.

[0007] With respect to regulation of the nodulation genes, other regulatory genes with similarities to nodD1 (genes that belong to the lysR family) have been identified in various strains (Davis and Johnston, 1990). The function of these genes, called nodD2, nodD3 or syrM, is only partially understood. Some nodD genes have been described (WO 94/00466; CA 1314249; WO 87/07910; U.S. Pat. No. 5,023,180). Also, recombinant DNA molecules including the consensus sequence of the promoters of nodD1-regulated genes, called nod-boxes (Fisher and Long, 1993), have been disclosed (U.S. Pat. No. 5,484,718; U.S. Pat. No. 5,085,588). Finally, recombinant plasmids with the nodABC genes or, in one case (Bradyrhizobium japonicum), a sequence influencing host specificity have been disclosed (U.S. Pat. No. 5,045,461; U.S. Pat. No. 4966847).

[0008] (B) Symbiotic Nitrogen Fixation

[0009] Inside the nodules, rhizobia differentiate into bacteroids that express the enzymatic complex (nitrogenase) required for the reduction of atmospheric nitrogen into ammonia. The nitrogenase is encoded by three genes nifH, nifD and nifK which are well conserved in nitrogen fixing organisms (Badenoch-Jones et al., 1989). Many additional loci are necessary for functional nitrogenase activity. Those originally identified in Klebsiella pneumoniae are known as nif genes, whereas those found only in Rhizobium strains are described as fix genes (Fischer, 1994). Some of these gene products are required for the biosynthesis of cofactors, the assembly of the enzymatic complex or play regulatory and different accessory roles (oxygen-limited respiration, etc.). Many of these genes are less conserved among the various rhizobial strains and in some cases their function is still not fully understood. The high sensitivity of the nitrogenase complex to free oxygen requires a very strict control of most nif and fix gene expression. In this respect, the FixL, FixJ, FixK, NifA and RpoN proteins have been identified in representative Rhizobium species as the major regulatory elements that, in microanaerobic conditions, activate the synthesis of the nitrogenase complex (Fischer, 1994). Recombinant DNA molecules containing nif genes/promoters have been disclosed: nifH promoters of B. japonicum (U.S. Pat. No. 5008194), nifH and nifD promoter of R. japonicum (EP 164245), nifA of B. japonicum and R. meliloti (EP 339830), nifHDK and hydrogen-uptake (hup) genes of R. japonicum (EP 205071).

[0010] Many more genetic determinants play a significant role in the Rhizobium-legume symbiosis. Genes (exo, lps and ndv genes) involved in the production of extracellular polysaccharides (EPS), lipopolysaccharides (LPS) and cyclic glucanes of rhizobia play an essential role in the symbiotic interaction (Long et al., 1988; Stanfield et al., 1988). Mutation in these genes negatively influences the development of functional nodules. In this respect, some exopolysaccharides of the NGR234 derivative strain ANU280, have been disclosed (WO 87/06796). Although Nod factors seem to play a key role in the nodulation process, experimental data indicate that other signal molecules produced by the bacterial symbionts are required for functional symbiosis and may play a role in coordinating various steps such as the controlled invasion process, the release of rhizobia from the infection thread into the plant cell cytoplasm, the bacteroid differentiation process, etc. Moreover, the need for rhizobia to survive in the rhizosphere and to compete adequately with other microorganisms requires many more unidentified genes that, although they may not be characterised as proper symbiotic loci, do affect the efficiency of the various strains to induce functional nitrogen fixing symbiosis in field conditions. Finally, in our view genetic engineering of improved rhizobial strains cannot be pursued without a more extended knowledge of the structure and complexity of the Rhizobium symbiotic genome.

[0011] In this respect we decided to determine the complete DNA sequence of a symbiotic plasmid of Rhizobium sp. NGR234. In contrast to Bradyrhizobium and Azorhizobium that carry symbiotic genes on large chromosomes (ca. 8 Mbp) and to R. meliloti that harbours two very large symbiotic plasmids of 1.4 and 1.6 Mbp, NGR234 carries a single plasmid of ca. 500 kbp, pNGR234a. Moreover, it has been shown by transfer of pNGR234a into heterologous rhizobia, and even into non-nodulating Agrobacterium tumefaciens, that most nodulation functions are encoded by this plasmid (Broughton et al., 1984). The fact that NGR234 is able to interact symbiotically with more plants than any other known strain, and that a complete ordered cosmid library of pNGR234a was available, reinforced NGR234 as the best choice for a large-scale sequencing effort on a symbiotic plasmid (Perret et al., 1991; Freiberg et al., 1997).

[0012] Automated fluorescent methods have been used to sequence cosmids from eukaryotic organisms, including Saccharomyces cerevisiae (Levy, 1994), Caenorhabditis elegans (Sulston et al., 1992), Drosophila melanogaster (Hartl and Palazzolo, 1993), and Homo sapiens (Bodmer, 1994), as well as chromosomes from the prokaryotes Haemophilus influenzae (Fleischmann et al., 1995) and Mycoplasma genitalium (Fraser et al., 1995). In most large-scale sequencing centres this technology is based mainly on the shotgun approach. After random fragmentation of DNA (e.g. cosmids, bacterial artificial chromosomes (BACs), entire chromosomes) using sonication or mechanical forces, size-selected fragments are subcloned into M13 phages, phagemids or plasmids and sequenced by cycle sequencing using dye primers (Craxton, 1993). A disadvantage of this method is that DNA regions with elevated GC contents produce large numbers of compressions (unresolvable foci in sequence gels) in the dye primer sequences leading to several hundred compressions per assembled cosmid sequence. It is known that the use of dye terminators—fluorescently labelled dideoxynucleoside triphosphates—instead of dye primers reduces the number of compressions (Rosenthal and Charnock-Jones, 1993). Therefore, dye terminators are frequently being used for gap closure and proofreading after assembly of the shotgun data.

[0013] To sequence GC-rich cosmids with the highest accuracy, the effectiveness of shotgun sequencing with dye terminators in comparison to dye primer sequencing was investigated. To improve the incorporation of dye terminators into DNA, a modified Tag DNA polymerase carrying a single mutation was used (Tabor and Richardson, 1995). This enzyme has properties similar to a thermostable “sequenase” and is commercially available as Thermo Sequenase (Amersham, Buckinghamshire, UK) or AmpliTaq FS (Perkin-Elmer, Foster City, Calif., USA). Concentrations of dye terminators needed in the cycle sequencing reactions can be reduced by 20-250 times. It was found that dye terminator shotgun sequencing leads to compression-free raw data that can be assembled much faster than shotgun data mainly obtained by dye primer sequencing. This strategy thus allows a several-fold increase in speed to sequence individual cosmids. This was demonstrated by comparing assembly of the sequence data of two cosmids from pNGR234a generated by different chemistries: Cosmid pXB296 was sequenced with dye terminators, whereas data for pXB110 were obtained using the common dye primer method. Also disclosed is the analysis of the entire pXB296 sequence.

[0014] Moreover, the dye terminator shotgun sequencing strategy used to generate the sequence data for pXB296 was also used to sequence all the other remaining overlapping cosmids of the plasmid pNGR234a. In summary, 20 cosmids have been sequenced together with two PCR products and a subcloned DNA fragment derived from a cosmid identified as pXB564 in order to generate the plasmid's complete nucleotide sequence.

[0015] After its assembly, the analysis of the entire nucleotide sequence of pNGR234a, especially the determination of putative coding regions and the prediction of their expressible proteins and putative functions, was performed. Initially, analysis of the region covered by cosmid pXB296 was extended to cosmids pXB368 and pXB110. Thus, in approximately 100 kb of the plasmid (position 417,796-517,279) most ORFs and their deduced proteins with different putative functions were predicted. Subsequently, the rest of pNGR234a was analyzed.

SUMMARY OF THE INVENTION

[0016] The present invention provides the complete nucleotide sequence of symbiotic plasmid pNGR234a or degenerate variants thereof of Rhizobium sp. NGR234.

[0017] The present invention also contemplates sequence variants of the plasmid pNGR234a altered by mutation, deletion or insertion.

[0018] Also encompassed by the present invention are each of the ORFs derivable from the nucleotide sequence of pNGR234a or variants thereof.

[0019] In a preferred embodiment, the ORFs derived from the nucleotide sequence of pNGR234a encode the functions of nitrogen fixation, nodulation, transportation, permeation, synthesis and modification of surface poly- or oligosaccharides, lipo-oligosaccharides or secreted oligosaccharide derivatives, secretion (of proteins or other biomolecules), transcriptional regulation or DNA-binding, peptidolysis or proteolysis, transposition or integration, plasmid stability, plasmid replication or conjugal plasmid transfer, stress response (such as heat shock, cold shock or osmotic shock), chemotaxis, electron transfer, synthesis of isoprenoid compounds, synthesis of cell wall components, rhizopine metabolism, synthesis and utilization of amino acids, rhizopines, amino acid derivatives or other biomolecules, degradation of xenobiotic compounds, or encode proteins exhibiting similarities to proteins of amino acid metabolism or related ORFs, or enzymes (such as oxidoreductase, transferase, hydrolase, lyase, isomerase or ligase).

[0020] In another preferred embodiment, the ORFs are under the control of their natural regulatory elements or under the control of analogues to such natural regulatory elements.

[0021] The present invention also provides the sequences of the intergenic regions of pNGR234a which, in a preferred embodiment, are regulatory DNA sequences or repeated elements. In a further preferred embodiment, the intergenic sequences are ORF-fragments.

[0022] Also provided by the present invention are mobile elements (insertion elements or mosaic elements) derivable from the nucleotide sequences of the present invention.

[0023] The present invention also contemplates the use of the disclosed nucleotide sequences or ORFs in the analysis of genome structure, organisation or dynamics.

[0024] Also provided by the present invention is the use of the nucleotide sequences or ORFs in the subcloning of new nucleotide sequences. In a preferred embodiment, the new nucleotide sequences are coding sequences or non-coding sequences.

[0025] In yet a further preferred embodiment, the nucleotide sequences or ORFs are used in genome analysis and subcloning methods as oligonucleotide primers or hybridization probes.

[0026] The present invention further provides proteins expressible from the disclosed nucleotide sequences or ORFs.

[0027] Also contemplated by the present invention is the use of the disclosed nucleotide sequences, individual ORFs or groups of ORFs or the proteins expressible therefrom in the identification and classification of organisms and their genetic information, the identification and characterisation of nucleotide sequences, the identification and characterisation of amino acid sequences or proteins, the transportation of compounds to and from an organism which is host to said nucleotide sequences, ORFs or proteins, the degradation and/or metabolism of organic, inorganic, natural or xenobiotic substances in a host organism, or the modification of the host-range, nitrogen fixation abilities, fitness or competitiveness of organisms.

[0028] The present invention also provides plasmid pNGR234a of Rhizobium sp. NGR234 comprising the disclosed nucleotide sequence or any degenerate variant thereof.

[0029] The present invention also provides a plasmid harbouring at least one of the disclosed ORFs or any degenerate variant thereof.

[0030] The plasmids of the invention may be produced recombinantly and/or by mutation, deletion, insertion or inactivation of an ORF, ORFs or groups of ORFs.

[0031] The present invention also provides the use of the disclosed plasmids or variants thereof in obtaining a synthetic minimal set of ORFs required for functional Rhizobium-legume symbiosis, the modification of the host-range of rhizobia, the augmentation of the fitness or competitiveness of Rhizobium sp. NGR234 in the soil and its nodulation efficiency on host plants, the introduction of desired phenotypes into host plants using the disclosed plasmids as stable shuttle systems for foreign DNA encoding said desired phenotypes, or the direct transfer of the disclosed plasmids into rhizobia or other microorganisms without using other vectors for mobilization.

[0032] The nucleotide sequences of the present invention were advantageously obtained using known cycle sequencing methods. The preferred dye terminator/thermostable sequenase shotgun sequencing method used to generate the nucleotide sequences of the present invention, when applied to cosmids and when compared to other sequencing methods, was shown to yield sequence reads of the highest fidelity. Consequently, the speed of assembly of particular cosmids was increased, and the resultant high-quality sequences required little editing or proofreading. Thus, the preferred sequencing method described herein was successfully used to generate the complete nucleotide sequence of all the overlapping cosmids of plasmid pNGR234a, thereby resulting in the assembly of the complete sequence of the plasmid.

[0033] The complete sequence of pNGR234a is disclosed for the first time in this application, as are the majority of the ORFs predicted within the sequence. Putative functions have been ascribed to the novel and inventive ORFs disclosed herein and the proteins for which they code.

BRIEF DESCRIPTION OF DRAWINGS

[0034] The present invention is described below and illustrated thereafter in the appended examples, with reference to the following figures:

[0035]FIG. 1 A comparative graph showing the comparison of sequences from pXB296 created by different cycle sequencing methods.

[0036]FIG. 2 A schematic diagram showing the organization of the predicted ORFs in pXB296 from Rhizobium sp. NGR234.

[0037]FIG. 3 The complete nucleotide sequence of plasmid pNGR234a (with the pages labelled sequentially from 19961 to 1996142).

[0038]FIG. 4 A schematic diagram showing the map of the 20 sequenced cosmids covering the 536 kb symbiotic plasmid pNGR234a of Rhizobium sp. NGR234.

[0039]FIG. 5 A diagram indicating multiple alignments of the nucleotide sequence of the replication origins of various plasmids.

[0040]FIG. 6 A diagram indicating multiple DNA sequence alignments of the regions containing the origin of transfer of various plasmids.

[0041]FIG. 7 A schematic diagram showing a circular representation of the symbiotic plasmid pNGR234a of NGR234.

DETAILED DESCRIPTION OF THE INVENTION AND BEST MODE

[0042] Comparison of Different Shotgun Sequencing Strategies

[0043] The following is a more detailed description of certain key aspects of the present invention.

[0044] GC-rich cosmids were examined to investigate whether they could be sequenced much more efficiently using dye terminators throughout the shotgun phase instead of dye primers. As a test case, cosmid pXB296 with a GC content of 58 mol % from pNGR234a, the symbiotic plasmid of Rhizobium sp. NGR234, was exclusively sequenced using dye terminators in combination with a thermostable sequenase [Thermo Sequenase (Amersham)]. Another rhizobial cosmid with identical GC content, pXB110, was sequenced using traditional dye primer chemistry and Taq DNA polymerase.

[0045] Using the dye terminator/thermostable sequenase shotgun strategy, it was shown that most, if not all, compressions could be resolved and reads were produced with the highest fidelity among all sequencing chemistries tested. As a result, a much faster assembly of cosmid pXB296 in comparison to pXB110 was obtained. The shotgun data could be assembled into a high-quality sequence without extensive editing and proofreading. By measuring the error rate in overlapping regions between individual cosmids from pNGR234a, as well as the cosmid vector sequence itself (data not shown), it was estimated that the accuracy of the pXB296 sequence is higher than 99.98%. Using other thermostable sequenases such as AmpliTaq FS (Perkin-Elmer), similar results were expected because thermostable sequenases have similar properties.

[0046] Dye primer chemistry in combination with Thermo Sequenase was also examined. Although the peak uniformity of signals was much improved over dye primer/Taq DNA polymerase data, the number of compressions in GC-rich shotgun reads was not reduced significantly. Compressions in shotgun raw data enormously increase the overall effort of editing, proofreading, and finishing a cosmid as shown for pXB110 (Table 1).

[0047] Because of their longer reading potential, dye primer reads are helpful for gap closure. However, using ABI 373A sequencers (Applied Biosystems, Inc. (ABI), Perkin-Elmer, Foster City, Calif., USA), dye primer reads are, on average, only ˜50 bases longer than dye terminator reads.

[0048] Using the experimental conditions of the present invention, shotgun sequencing with dye terminators and a thermostable sequenase is superior because for GC-rich cosmid templates it removes most of the compressions and this leads to a several-fold improvement in assembling and finishing of cosmid-sized projects. Although dye terminators are slightly more expensive than dye primers, the overall saving in time for finishing projects has, in our experience, a much greater effect on general costs.

[0049] It has been shown that the strategy of the present invention is effective for high-throughput shotgun sequencing of GC-rich templates. This strategy was therefore used to sequence the remaining 19 overlapping cosmids of the symbiotic plasmid pNGR234a of Rhizobium sp. NGR234. In total, 20 cosmids, two PCR products (1.5 and TABLE 1 Comparison of the assembly of the sequence data from cosmids pXB296 (dye terminator shotgun reads) and pXB110 (dye primer shotgun reads) Data assembly pXB296 pXB110 Average length of the shotgun reads (bases) 332 378 No. of shotgun reads used for assembly 786 899 No. of shotgun reads assembled with 4% mismatch^(a) 736 308 No. of shotgun reads assembled with 25% mismatch^(a) 775 879 No. of contigs^(b) longer than 1 kbp 3 25 No. of contigs left after editing^(c) 2 4 No. of additional reads (gap closure and proofreading)^(d) 32 191 Total length of cosmid insert (bp) 34,010 34,573 Sequencing redundancy (per bp) 8.0 10.5 # permitted, minimum initial match = 15, maximum no. of pads per reading during the alignment procedure = 8, # maximum no. of pads per reading in contig to align any new reading = 8, alignment mismatches 4% and 25%, # respectively. # on selected templates and, in case of pXB110, for solving ambiguities (compressions) by the resequencing of clones # with universal primer and dye terminators.

[0050] 2.0 kb in length) and a 1.5 kb restriction fragment were sequenced in order to generate the complete pNGR234a sequence (FIG. 4).

[0051] Genetic organization of pXB296

[0052] All 28 predicted open reading frames (ORFS) in pXB296 (FIG. 2) show significant homologies to database entries (Table 2). The first putative gene cluster (cluster I) containing ORF1 to ORF5 corresponds to various oligopeptide permease operons (Hiles et al., 1987; Perego et al., 1990). Only ORF5 shows homology to a gene from a different bacterium, Bacillus anthracis (Makino et al., 1989). Each homologue encodes membrane-bound or membrane-associated proteins suggesting that all five ORFs are involved in oligopeptide permeation.

[0053] Organization of the predicted gene cluster IV, including the nifA homologue ORF16 (fixABCX, nifA, nifB, fdxN, ORF, fixU homologues, position 16,746-24,731), the predicted locations of the σ⁵⁴-dependent promoters and the nifA upstream activator sequences (FIG. 2), correspond to the organization found in Rhizobium meliloti and Rhizobium leguminosarum bv. trifolli. (Iismaa et al., 1989; Fischer, 1994). NifA is a positive transcriptional activator (Buikema et al., 1985), whereas nif and fix genes are essential for symbiotic nitrogen fixation. Identification of σ⁵⁴-dependent promoter sequences, together with the upstream activator motifs upstream of ORF21, ORF22, and ORF23, suggests that these ORFs may play an important, but still undefined, role in symbiosis.

[0054] Inevitably, large-scale sequencing uncovers differences with already published sequences. van Slooten et al. (1992) cloned a 5.8 kb EcoRI fragment from Rhizobium sp. NGR234 and sequenced 2067 bp by manual radioactive methods (EMBL accession no. S38912). This sequence exhibits 2.4% mismatches with the corresponding sequence in pXB296. TABLE 2 Putative ORFs of pXB296 and homologies of the deduced amino acid sequences to known proteins riboso- mal binding site: homo- SD-sequence-dis- logous tance from start no. of amino homologous position on codon (bases)- deduced acids protein iden- simi- cosmid start codon^(d) amino (po- length acces- tity larity ORF^(a) st.^(b) (based no.)^(c) SD-sequence: acids sition) name (aa)^(c) function^(f) sion no (%)^(g) (%)^(g) ORF1^(h) + 00001-00625 5′-TAAGGAGGTGA-3′ >207   1-207 OppB 306 oligopeptide X05491 45 68 ORF2 + 00628-01503 GTATCCGGT-7-ATG 291   2-289 OppC 305 permease X56347 37 63 ORF3 + 01505-02512 AGCGGAGG-7-ATG 335   8-327 OppD 336 proteins X56347 49 69 ORF4 + 02509-03570 TGAAGTGGT-6-ATG 353   2-323 OppF 334 X05491 51 69 ORF5 + 03606-04991 CAAGGA-6-ATG 461   1-458 CapA 411 encapsulation M24150 25 48 protein ORF6 + 05460-06863 CCGAGAGG-8-ATG 467   1-464 BioA 455 aminotrans- M29292 29 55 ferase ORF7 + 06888-08426 GCCTTCGG-5-GTG 512  97-509 ORF^(i) 417 unknown D37877 36 58  34-510 GapD 482 succinic M38417 33 57 semialdehyde dehydrogenase ORF8 − 09781-10860 GAACGTGG-8-ATG 359  72-299 ORF^(i) 414 transposase X15942 30 48 homologue minicircle DNA ORF9 + 11124-12455 ?-7-ATG 443   2-443 GLUDI 558 glutamate M37154 41 60 dehydrogenase ORF10 − 13370-14116 AAAGGA-6-ATG 248   1-245 ORF2¹ 231 transposase X79443 45 64 ORF11 − 14128-15672 CATGGAG-7-TTG 514   1-513 ORF1¹ 558 homologues, X79443 41 62 IS1162 ORF12 − 16712-16942 GAAGGA-8-ATG 76  1-70 FixU 70 unknown P42710 63 80 ORF13 − 16939-17265 ACAAGAGG-7-ATG 109  1-79 ORF2^(i) >78 unknown X07567 53 81  15-107 NifZ 159 involved in M20568 39 56 FeMocofactor synthesis ORF14 − 17349-17543 CCAGGAG-9-ATG 64  1-64 FdxN 64 ferredoxin- M21841 80 87 like ORF15 − 17585-19066 AGTGGAG-7-ATG 493   1-493 NifB 490 involved in M15544 73 84 FeMocofactor synthesis ORF16 − 19292-20962 ATTGG-12-ATG 556   9-556 NifA 541 transcrip- X02615 59 72 tional regulator ORF17 − 21129-21422 AGGGGAG-7-ATG 97  1-97 FixX 98 required for M15546 84 87 ORF18 − 21437-22744 AACTGAGGT-7-ATG 435   1-435 FixC 435 nitrogen M15546 83 90 ORF19 − 22755-23864 ATAGGAG-6-ATG 369  18-369 FixB 353 fixation M15546 79 89 ORF20 − 23874-24731 TAAAGAG-5-ATG 285   1-285 FixA 292 M15546 74 85 ORF21 − 25148-25468 CCAGGAG-10-ATG 106   1-106 ORF118^(i) 108 unknown X13691 55 71 ORF22 − 26145-26711 GAAGGAG-9-ATG 188   9-199 − 241 hypothetical U32739 47 64 protein   1-173 − 166 peroxisomal U11244 32 57 protein ORF23 + 27169-27861 GAAGGA-7-ATG 230   1-167 NifQ 167 probably in- X13303 37 57 volved in Mo-processing ORF24 + 27920-29434 CTGGGAGG-18-ATG 504   1-454 DctA1 456 C₄-dicarboxy- S38912 97 98 late   8-454 DctA2 449 transporter S38912 97 98 ORF25 + 29431-30675 TTCGGCGG-12-ATG 414   2-414 CamC 415 cytP450-like M12546 34 53 ORF26 + 30676-31332 TTGGG-5-TTG 218  30-190 LinA 155 γ-hexachloro- D90355 27 51 cyclohexan- dechlorinase ORP27 + 31329-33035 AGTGGAG-10-ATG 568  28-270 FabG 244 reductase M84991 38 57 294-534 30 57 ORF28^(k) + 33173-34010 CAAGGAG-5-ATG >279   1-279 LuxA 355 luciferase M10961 23 49 α-subunit #tobacter vinelandii (NifZ), Rhizobium meliloti (FdxN, NifBA, FIxXCBA), Bradyrhizobium japonicum (ORF118), Haemophilus influenzae (hypothetical protein), Lipomyces kononenkoae (peroxisomal protein), Klebsiella pneumonia (NifQ), Rhizobium sp. NGR234 (DctA), Pseudomonas putida (CamC), Pseudomonas paucimobilis (LinA), Escherichia coli (FabG), Vibrio harveyi (LuxA).

[0055] It contains the gene dctA (encoding a C₄-dicarboxylate permease), which is 144 bases shorter than in pXB296. In this respect, a single nucleotide deletion in position 29,248 of the cosmid sequence close to the 3′ end of the gene causes a frameshift leading to a DctA product extended by 48 residues. van Slooten et al. (1992) also failed to identify the nifQ homologue, ORF23 (position 27,169-27,861), presumably because they overlooked a small XhoI fragment located between positions 27,349 and 27,536 on pXB296. Expression studies allowed these investigators to define a putative a σ⁵⁴-dependent promoter in a 1.7 kb SmaI fragment (position 27,094-28,818 in pXB296). This fragment stretches from the upstream region of ORF23 to the 5′ part of dctA. The 58 bp intergenic region between ORF23 and dctA contains a stem-loop structure but no obvious promoter sequence. Possibly the promoter that controls dctA is located upstream of ORF23 (e.g. the minimal consensus sequence included in GGGGGCACAATTGC at position 27,098-27,111). Although clones containing dctA complemented mutants of R. meliloti and R. leguminosarum for growth on dicarboxylates, the growth of the NGR234 dctA deletion mutant was not affected (van Slooten et al., 1992). Nevertheless, this mutant was unable to fix nitrogen in nodules. Because dctA is now possibly part of a larger transcription unit, the symbiotic phenotype may also result from the inactivation of downstream genes.

[0056] Interestingly, the GC content of the predicted pXB296 ORFs ranges from 53.3 mol % to 64.6 mol %, with an overall cosmid GC content of 58.5 mol %. Genomes of Azorhizobium, Bradyrhizobium, and Rhizobium species have GC contents of 59 mol % to 65 mol % (Padmanabhan et al., 1990), with 62 mol % reported for Rhizobium sp. NGR234 (Broughton et al., 1972). Although pXB296 covers <7% of the complete symbiotic plasmid sequence, its lower overall GC value suggests that symbiotic genes might have evolved by lateral transfer from other organisms. In this case, methods of the type applied in the present invention will become even more relevant in sequencing the whole genome.

[0057] Genetic Organization of the 100 kb Region Covered by Cosmids pXB296, pXB368 and pXB110

[0058] Extending the analysis of pXB296 to a 100 kb region stretching from position 417,796 to 517,279 on the symbiotic plasmid pNGR234a led initially to the assignation of only 76 ORFs listed within Table 3 (excluding the first incomplete ORF noted in the analysis of pXB296 (“ORF1” of Table 2)). The ORFs y4tQ to y4vJ (excluding ORFs y4uD and y4uG and excluding ORF-fragments fu1, fu2, fu3, fu4 and fv1; see Table 3) are identical to the ORFs 2 to 28 of the analysis of pXB296 in Table 2 apart from minor revisions (N.B. the analysis recited in Table 3 should be taken as the definitive analysis—Table 2 merely represents preliminary findings). The cosmid pXB110, which was sequenced with the dye primer shotgun sequencing strategy in order to compare it with the dye terminator shotgun sequencing strategy used to sequence cosmid pXB296, in combination with pXB296 and pXB368 cover nearly this entire region. A PCR product and a restriction fragment of cosmid pXB564 also had to be sequenced in order to fill in the gap from position 480,607 to 483,991 between cosmids pXB368 and pXB110 (FIG. 4). Among the 76 predicted ORFs, 7 ORFs and their deduced proteins show no homologies to database entries. The other predicted ORFs and their deduced proteins do exhibit such homologies and therefore play putative roles in nitrogen fixation (ORFs y4uJ to y4vB, y4vE, y4vN to y4vR, y4wK and y4wL), nodulation (ORFs y4yC and y4yH), transportation (ORFs y4tQ to y4uA, y4vF and y4wM), secretion of proteins or other biomolecules (ORFs y4yI and y4yO), transcriptional regulation/DNA binding (ORFs y4wC and y4xI), in amino acid metabolism or metabolism of amino acid derivatives (ORFs y4uB, y4uC, y4uF, y4wD, y4wE and y4xN to y4yA), degradation of xenobiotic compounds (ORFs y4vG to y4vI), in peptidolysis/proteolysis (ORFs y4wA and y4wB) or transposition (ORFs y4uE, y4uH and y4uI) (see Table 3). The TABLE 3 List of the predicted functional ORFs and of fragments representing putative remnants of functional ORFs no. of hom. func- position in deduced amino hom. protein tional plasmid amino acids length accession I/ S/ ORF^(a) name st.^(b) (base no.)^(c) acids (position) name (aa)^(d) no.^(e) %^(f) %^(f) note^(g) y4aA −2/3 534696-000474 647 16-646 Shc 658 X86552 78 88 prob. squalene-hopene-cyclase; put. operon y4aABCD: inv. in synthesis of an isoprenoid compound y4aB −3 000523-001776 417 6-415 ORF1 414 X80766 43 63 put. flavoprotein oxidoreductase y4aC −2 001776-002615 279 3-247 Psy1 419 X68017 34 50 put. phytoene synthase y4aD −1 002612-003490 292 10-195 CrtI 342 L37405 33 51 hyp. protein hom. to squalene and phytoene synthetases fa1 −3 003487-004011 fragmentous character y4aF nolK −3 005173-006117 314 9-310 ORF14.8 321 U46859 51 70 put. NAD-dep. nucleotide sugar epimerase/dehydrogenase; NoeJKL/NodZ/NolK inv. in biosynthesis of fucose moiety of Nod factors y4aG noeH −2 006126-007181 351 4-339 RfbD 348 U24571 65 80 put. GDP-D-mannose dehydratase y4aH nodZ −1 007426-008394 322 3-254 NodZ 324 L22756 69 83 put. fucosyltransferase y4aI noeK −3 008623-010047 474 5-471 ORF5 483 U47057 42 59 put. phosphomannomutase y4aJ noeJ +3 010110-011648 512 33-498 XanB 466 M83231 50 65 put. mannose-1-phosphate guanylyltransferase y4aK +2 012125-012277 50 hyp. 5.5 kd protein y4aL nodD1 +2 012380-013348 322 1-322 NodD1 322 Y00059 98 99 transcriptional regulator (LysR family); high similarity to 1-310 NodD2 312 this work 68 84 Y4xH(NodD2) y4aM +3 013911-014342 143 7-132 ORF3 127 L13845 50 66 put. DNA-binding protein; high similarity to Y4wC 1-143 Y4wC 143 this work 69 77 y4aN +1 01014488-014934 148 1-129 ORF3 128 X04833 41 56 homologue located nearby the replicator region of pRiA4b y4aO +3 015065-015643 192 hyp. 21.8 kd protein; low similarity to Y4nF(<30% id.) y4aP mucR +3 016161-016592 143 1-143 MucR 143 L37353 89 95 put. transcriptional regulator (Ros/MucR family); similarity to Y4pD; possibly inv. in regulation of exopolysaccharide synthesis y4aQ −2 017016-017582 188 15-167 No1265 266 X74068 33 50 hyp. 20.4 kd protein; similar to Y4hP, Y4jD, Y4qI y4aR +2 017798-018121 107 hyp. 12.1 kd protein y4aS +1 018121-018666 181 hyp. 20 kd protein fa2 +3 018912-019664 250 126-250 Tnp 465 U04047 38 51 hyp. protein fragment 78-150 Y4iG 90 this work 93 97 3-266 Y4bF 457 this work 53 73 y4bA −2 019674-021758 694 1-393 fo6 430 this work 89 95 hyp. 78.7 kd protein; identical to Y4pH 406-532 fo5 136 this work 83 94 532-694 fo4 143 this work 77 83 y4bB −3 021748-022014 88 2-88 Y4oL 88 this work 63 69 hyp. 9.7 kd protein precursoer; identical to Y4pI y4bC −1 022034-022483 149 1-149 Y4oM 149 this work 79 88 hyp. 16.8 kd protein; identical to Y4pJ y4bD −2 022674-022943 89 20-89 Y4oN 70 this work 73 84 hyp. 10.2 kd protein; identical to Y4pK fb1 +2 022985-023659 224 36-224 Y4bF 457 this work 42 63 hyp. protein fragment y4bF +1 023953-025326 457 130-436 Tnp 465 U04047 31 46 put. transposase; 2-265 Fa2 266 this work 53 73 upstream of this ORF (23875-23987) 89% nt-id. to part 77-169 Y4iG 90 this work 51 72 of origin of replication-region (R. meliloti, S66221) 285-457 Fb1 188 this work 42 63 410-457 Y4jM 70 this work 75 79 y4bG +1 025870-026685 271 hyp. 30 kd protein precurser y4bH +1 028513-028788 91 hyp. 9.6 kd integral membrane protein y4bI +3 028860-029276 138 3-108 HI1631 190 U00085 41 61 hyp. 15.3 kd protein precurser y4bJ +1 029392-031284 630 429-564 HtrA 503 L20127 40 53 hyp. 67.9 kd integral membrane protein, distantly related to peptidase family S2C y4bK +2 031625-032293 222 83-212 ORF1 215 D84146 25 45 hyp. 24.3 kd protein y4bL +1 032641-034191 516 7-515 ORF1 558 X79443 44 63 identical to Y4kJ and Y4tB; similar to Fo3 and Fo7; put. 6-516 Y4uI 515 this work 48 66 transpposase y4bM +3 034188-034979 263 1-203 ORF2 231 X79443 45 62 identical to Y4kI and Y4tA; put. insertion sequence ATP- 6-248 Y4pL 245 this work 55 73 binding protein; similarity to Y4pL, Y4uH, also to 6-254 Y4uH 248 this work 48 68 Y4sD/Y4nD/Y4iQ 1-263 Y4iQ 298 this work 31 56 y4bN +1 035278-036573 431 hyp. 47.6 kd protein y4bO +1 036646-038466 606 hyp. 66.8 kd protein y4cA −1 038576-042169 1197 hyp. 137.7 kd protein; largest protein in pNGR234a y4cB −3 042226-042522 98 hyp. 10.2 kd integral membrane protein y4cC −3 042556-044109 517 hyp. 57.8 kd protein y4cD −2 044106-046028 640 hyp. 71.6 kd protein y4cE −3 046486-047661 391 hyp. 43.4 kd protein y4cF −1 047687-048829 380 hyp. 41.8 kd protein y4cG +2 049361-050278 305 16-173 Pin 184 K00676 50 68 prob. DNA invertase “resolvase-type” 17-222 Y41S 183 this work 40 60 y4cH −2 050427-050636 69 4-65 CspS 70 L23115 56 70 prob. cold shock regulator y4cI −2 053202-054416 404 1-397 RepC 405 X04833 60 73 put. replication protein C y4cJ −3 054571-055551 326 1-317 RepB 319 X89447 39 55 put. replication protein B y4cK −2 055608-056831 407 10-404 RepA 398 X89447 58 73 put. replication protein A y4cL traI +2 057635-058261 208 1-206 TraI 212 U43675 55 66 prob. autoinducer synthetase (inv. in control of conjugal transfer) y4cM trbB +3 058272-059249 325 3-325 TrbB 323 U43675 80 88 prob. conjugal transfer protein (PulE family) 1-115 Y4oG 125 this work 25 51 y4cN trbC +1 059239-059622 127 7-127 TrbC 134 U43675 69 78 prob. conjugal transfer protein (integral membrane prot.) y4cO trbD +2 059615-059914 99 1-99 TrbD 99 U43675 70 89 prob. conjugal transfer protein (integral membrane prot.) y4cP trbEa +3 059925-060374 149 1-136 TrbE 820 U43675 80 91 prob. conjugal transfer protein (hom. to 5′ part of trbE) y4cQ trbEb +1 060394-062382 662 5-659 TrbE 820 U43675 83 90 prob. conjugal transfer protein (hom. to 3′ part of trbE) y4dA trbJ +2 062354-063157 267 1-107 TrbJ 175 U43675 60 69 prob. conjugal transfer protein 194-267 71 79 y4dB trbK +1 063154-063351 65 5-65 TrbK 75 U43675 40 56 prob. conjugal transfer protein precurser y4dC trbL +3 063345-064520 391 3-387 TrbL 395 U43675 74 85 prob. conjugal transfer protein (integral membrane prot.) y4dD trbF +2 064544-065206 220 1-220 TrbF 220 U43675 80 90 prob. conjugal transfer protein y4dE trbG +1 065224-066036 270 6-270 TrbG 284 U43675 74 84 prob. conjugal transfer protein precursor y4dF trbH +1 066040-066486 148 1-147 TrbH 159 U43675 55 68 prob. conjugal transfer protein precurser (with lipid anchor) y4dG trbI +3 066498-067793 431 1-430 TrbI 433 U43675 66 79 prob. conjugal transfer protein (integral membrane prot.) y4dH traR +2 068096-68806 236 7-236 TraR 234 Z15003 28 45 prob. transcriptional activator of conjugal transfer genes (LuxR family) y4dI traM −1 068810-069133 107 8-101 TraM 102 U43674 30 51 prob. modulator of TraR/autoinducer-mediated activation of tra genes y4dJ +3 069351-069584 77 1-67 ORF 84 X16458 37 59 hyp. transcriptional regulator (PbsX family); low similarity to N-terminus of Y4dL y4dK −1 069629-069949 106 hyp. 11.8 kd protein fd1 −2 069936-070250 (105) (2-85) ORFA 400 X67861 39 58 put. transposase fragment y4dL +1 070603-071193 196 — — — — — — hyp. 21.8 kd protein; low similarity to Y4dJ y4dM +2 071186-072415 409 1-357 HipA 440 M61242 31 46 hyp. 45.3 kd protein; homolog affects frequency of 3-405 Y4mE 420 this work 34 56 persistence after inhibition of cell wall or DNA synthesis y4dN +1 072787-072975 62 — — — — — — hyp. 7 kd protein y4dO −1 073550-073951 133 12-121 ORF 381 D83536 43 57 hyp. 14.9 kd (fragmentous?) protein; homology to intron protein of P. anserina continues in fr.-2 (73541-73467) y4dP −1 074423-075025 200 1-48 ORFR2 57 U43674 72 89 hyp. 21 kd protein; hom. to conjugal transfer region 1 56-198 ORFR3 154 47 71 y4dQ traB −2 075042-076205 387 1-387 TraB 421 U40389 61 72 prob. conjugal transfer protein y4dR traF −3 076195-076761 188 20-188 TraF 176 U40389 55 73 prob. conjugal transfer protein y4dS traA −2 076758-080066 1102 1-1102 TraA 1100 U43674 67 79 prob. conjugal transfer protein (relaxase) y4dT traC +3 080319-080627 102 1-102 TraC 98 U40389 64 80 prob. conjugal transfer protein y4dU traD +1 080632-080847 71 1-71 TraD 71 U43674 77 84 prob. conjugal transfer protein y4dV traG +2 080834-082756 640 1-631 TraG 658 U40389 71 83 prob. conjugal transfer protein fd2 + 083002-083293 ORFL1 152 U43674 fragments hom. to ORFL1 (conjugal transfer region1); frameshifts: 83072 (1 > 3), 83161 (3 > 2) y4dW +1 083305-083919 204 hypothetical 22.9 kd protein y4dX +1 083944-84522 192 hypothetical 20.6 kd protein y4eA −2 084570-084836 88 hypothetical 9.9 kd protein y4eB −3 084976-085290 104 hypothetical 11.6 kd protein fe1 − 085829-088007 MerA 474 X65467 put. fragments; homology to mercuric reductase, put. frameshifts: 86592 (−1 < −3), 87288 (−3 < −2) y4eC −2 088305-089228 307 14-306 TraC-1 1061 X59793 38 55 hyp. 34.2 kd protein; hom. to 5′end. of traC-1 from plasmid RP4 y4eD +1 091051-092178 375 51-136 ORF145 145 X52594 29 55 put. phosphodiesterase; low homology to glycerophosphoryl-diester-phosphodiesterase y4eE +1 092212-093288 358 hyp. 38.5 kd protein fe2 − 093572-093969 TnpA U14952 fragments of put. transposase; put. frameshift: 93798 (2 < 3) y4eF −1 093980-094735 251 2-236 Int 259 U14952 37 53 put. integrase/recombinase (“phage-type”); similar to 1-251 Y4qK 308 this work 92 94 Y4rF (35% aa-id.); low similarity to Y4rABCDE fe5 −1 094988-095188 66 1-66 Fq6 66 this work 79 94 put. defective integrase/recombinase 1-66 Y4rC 332 this work 41 55 fe3 − 095343-096025 Int 259 U14952 fragments hom. to integrase; put. frameshift: 95559- 95671 (−2 < −1) y4eH nolL −2 096093-097193 366 11-359 NolL 373 U22899 63 77 nodulation protein; hyp. acetyl transferase y4eI −2 097914-098225 103 hyp. 11.1 kd protein with transmembrane domain fe6 +3 098358-098657 99 3-98 AatB 410 L12149 40 55 hyp. 10.3 kd protein fragment, hom. to C-terminal part of bacterial aminotransferases y4eK +2 098675-099421 248 10-245 Adh 252 U00084 37 53 hyp. short chain type dehydrogenase/reductase y4eL +3 099447-100193 248 1-244 Gno 256 X80019 31 47 hyp. short chain type dehydrogenase/reductase fe4 + 100270-101901 IIvG M37337 put. fragment; put. frameshifts: 100721 (1 > 2), 101728 (2 > 1) fe7 −1 101585-102298 237 1-103 Tnp 398 U08627 91 95 put. truncated transposase-like protein; similar to Y4pO y4eN −3 102625-102936 103 hyp. 11.5 kd protein y4eO −2 102933-103598 221 hyp. 24.5 kd protein y4fA −1 103805-106342 845 327-837 McpA 657 X66502 41 59 prob. methyl-accepting chemotaxis protein 7-845 Y4sI 756 this work 29 49 y4fB +3 106620-108614 664 hyp. 73.7 kd protein y4fC +3 109884-110618 244 10-163 DszA 453 L37363 38 52 hyp. (fragmentous?) monooxygenase; extended homology to DszA in fr.2: 110372 to 110506. y4fD −1 110516-111178 220 hyp. 24.6 kd integral membrane protein y4fE −2 111195-111677 160 hyp. 17.2 kd protein precursor y4fF −1 111803-112348 181 hyp. 19.5 kd protein y4fG −2 112338-112727 129 hyp. 14.5 kd protein y4fH −1 113474-113782 102 hyp. 11.6 kd protein ff1 −3 113779-114114 111 61-97 DppF 330 L08399 56 86 hyp. protein fragment, similar to central region of oligo/di-peptide ABC transporter ATP-binding proteins y4fJ −2 114348-115379 343 3-210 RopA 318 M69214 53 66 put. outer membrane protein (porin) precurser y4fK −2 116112-117395 427 275-421 XylS2 157 L02642 31 53 put. transcriptional regulator (AraC family) y4fL −3 117385-118212 275 9-243 ORF 268 U39059 32 46 hyp. 29.1 kd integral membrane protein, belongs to the inositol monophosphatase family y4fM −2 118209-119144 311 hyp. 35.5 kd protein y4fN −2 119145-120854 569 11-513 CysU 550 U32807 23 45 prob. ABC transporter permease protein; put. part of binding-protein-dependent transport system Y4fNOP y4fO −1 120851-121870 339 12-247 PotA 381 U32759 49 68 prob. ABC transporter ATP-binding protein y4fP −1 121883-122959 358 32-293 SufA 338 M33815 23 42 prob. ABC transporter periplasmic binding protein precurser y4fQ +1 123016-124194 392 9-234 NagC 406 X14135 25 46 hyp. 41.6 kd protein; belongs to “ROK” family (transcriptional regulator or transferase) y4fR +1 124813-126453 546 88-539 IpaH 532 M32063 38 54 hyp. 60.5 kd protein, hom. to invasion plasmid antigen H y4gA −1 126806-127369 187 hyp. 20.9 kd protein; low similarity to Y4rE y4gB −2 127485-127904 139 hyp. 16.1 kd protein y4gC −1 127901-128479 192 1-178 ORF2 415 L34580 43 58 put. integrase/recombinase (“phage-type) y4gD −1 128579-128857 92 hyp. 10.5 kd protein y4gE +2 131021-131767 248 hyp. (fragmentous?) 27.7 kd protein; put. frameshifts: 131532 (2 > 1), 131892 (1 > 2) y4gF +2 132734-133786 350 4-345 RhsB 353 U51197 65 74 prob. dTDP-D-glucose-4,6-dehydratase (Y4gFGH inv. in dTDP-L-rhamnose biosynthesis) y4gG +2 133790-134680 296 1-290 RhsD 288 U51197 48 66 prob. dTDP-4-dehydrorhamnose reductase y4gH +1 134677-135537 286 2-285 RfbA 293 U09876 65 82 prob. glucose-1-phosphate thymidylyltransferase y4gI +3 135534-138263 909 276-894 RfbC 1275 U36795 38 55 hyp. 102.8 kd protein (homolog is involved in O-antigen biosynthesis) y4gJ −1 138737-139315 192 hyp. 21.1 kd protein y4gK fixF +3 142026-143234 402 114-184 KpsS 389 X74567 26 54 necessary for functional nitrogen fixation, hom. to 203-362 30 53 capsule polysaccharide export protein y4gL −3 143473-144060 195 24-192 RhsC 188 U51197 53 65 prob. dTDP-4-dehydrorhamnose-3,5-epimerase (inv. in dTDP-L-rhamnose biosynthesis) y4gM −2 144147-145907 586 26-581 MsbA 582 Z11796 32 56 prob. ABC transporter ATP-binding protein y4gN +2 146075-147226 383 52-297 VirA 304 L08012 29 46 hyp. 45 kd protein y4hA −1 147455-148558 367 7-362 ChaA 366 L28709 34 58 put. ionic transporter y4hB noeE −3 148819-150078 419 3-138 F42G9.8 359 U00051 32 49 nodulation protein (put. sulfate transferase) 197-289 25 50 y4hC noeG −3 151051-151782 243 18-229 u0002kb 243 U00024 27 42 nodulation protein (unknown function) y4hD nolO −1 151979-154021 680 1-126 NolN 127 L22756 70 83 inv. in O-carbamoylation of Nod factors (sim. to NodU) 140-496 NolO 358 78 89 y4hE nodJ −3 154120-154908 262 5-261 NodJ 262 J03685 69 84 prob. ABC transporter permease (see nodI) y4hF nodI −3 154912-155943 343 15-343 NodI 339 X55795 69 85 prob. ABC transporter ATP-binding transport protein; put. role: together with NodJ export of modified beta-1,4- N-glucosamine oligosaccharides y4hG nodC −1 156095-157336 413 1-413 NodC 413 X73362 99 100 N-acetylglucosaminyltransferase y4hH nodE −3 157351-157998 215 1-215 NodB 214 X73362 99 99 chitooligosaccharide deacytelase y4hI nodA −2 157995-158585 196 1-196 NodA 196 X73362 100 100 N-acyltransferase; nodABC involved in synthesis of backbone of modified N-acylated glucosamine oligosaccharides y4hJ −1 158993-159775 260 59-240 ORF2 251 L133618 68 81 hom. to part of coproporphyrinogen III oxidase (lacks C- terminus and conserved N-term. domain) y4hK +3 160722-161465 247 hyp. 25.4 kd integral membrane protein y4hL +1 161569-161826 85 hyp. 9.6 kd protein y4hM +1 163042-164253 403 53-169 Gfor 439 M97379 31 54 hyp. 43.9 kd protein (partially hom. to glucose-fructose oxidoreductase) y4hN +2 164600-165034 144 10-144 ORFA 135 X84099 38 53 hyp. 16 kd protein; partially hom. to Y4jB and Y4rG y4hO +1 165037-165384 115 1-115 ORF140 140 X74068 100 100 hyp. 12.8 kd protein 1-115 ORFC 144 X84099 54 69 1-115 Y4jC 117 this work 36 62 y4hP +1 165430-167088 552 1-215 no1265 266 X74068 97 97 hyp. 61.7 kd protein; similar to Y4aQ, Y4jD and Y4qI 80-328 ORF2 258 M10204 67 79 362-492 ORF3 163 M10204 47 61 y4hQ +3 167091-167675 194 5-185 ORF3 237 X51418 35 53 hyp. 21.7 kd protein 1-52 ORF91 >91 X74068 96 98 y4hR −3 167710-167934 74 fi1 − 168208-168300 hyp. transposase fragment similar to R. meliloti ISRm2011-2 fi2 +1 168430-168792 120 1-130 Y4iO 252 this work 78 87 put. defective transposase (homologous to N-terminal 1-108 Y4rJ 396 this work 74 87 parts of Y4iO and Y4rJ) fi3 +2 168798-169190 130 1-109 ORF1A 317 M33159 37 55 put. defective transposase (hom. to C-terminal parts of 1-130 Y4iO 252 this work 78 87 Y4iO and Y4rJ); additionally weak homology to 1-130 Y4rJ 396 this work 76 84 Y4pF/Y4sB and Y4qE (<30% identity) y4iR −3 169231-169716 161 15-145 PsiB 134 L26581 55 74 hyp. protein (homolog located in a polysaccharide biosynthesis inhibition operon y4iC −2 169929-110621 230 58-123 ORF 161 Z73419 41 54 hyp. 25.8 kd protein (ORF = MTCY373.06) y4iD −3 170563-172551 662 137-342 ORF 495 Z73101 40 59 prob. monooxygenase (ORF = MTCY31.20) 418-605 28 51 y4iE +3 173295-173702 135 1-135 Y4rL 155 this work 33 52 hyp. 15.4 kd (fragmentous?) protein; similar to Y4zA y4iF −3 174211-175128 305 hyp. 34.1 kd protein y4iG −2 175590-175862 90 1-73 Y4aT 266 this work 93 97 hyp. 10.5 kd (fragmentous?) protein 1-73 Y4bF 457 this work 60 76 y4iH +2 176045-176764 239 1-236 Y4jT 336 this work 32 53 hyp. 26 kd protein precurser y4iI −2 176937-179048 703 hyp. 76.2 kd integral membrane protein y4iJ −2 179097-180887 596 hyp. 65.5 kd protein; ;ow similarity to Y4iM y4iK −3 180940-181638 232 hyp. 26.8 kd protein; y4iKL: two fragments of one gene?; put. frameshift; 181884 (−3 < −2) y4iL −2 181692-182990 432 hyp. 47.8 kd protein; y4iKL two fragments of one gene?; put. frameshift: 181884 (−3 < −2) y4iM −2 183036-184334 432 hyp. 47.1 kd protein; low similarity to Y4iJ; y4iMN two fragments of one gene?; put. frameshift: 184440 (−2 < −3) y4iN −3 184309-184935 208 hyp. 22.1 kd protein precurser; y4iMN two fragments of one gene?; put. frameshift: 184440 (−2 < −3) y4iO −2 185679-186437 252 17-243 Tnp 334 Z48244 29 46 put. transposase or transposase-fragment; additionally 1-121 Fi2 120 this work 67 79 weak homology to Y4pF/Y4sB and Y4qE (<30% identity) 123-252 Fi3 130 this work 78 87 1-252 Y4rJ 396 this work 71 83 y4iP −1 186437-186832 131 4-163 Y4rJ 396 this work 58 80 hyp. 14.4 kd protein or fragment hom. to N-term. of Y4rJ y4iQ −3 187162-188058 298 13-253 IstB 265 U38187 34 56 identical to Y4nD/Y4sD; put insertion sequence ATP- 8-283 Y4bM 263 this work 31 56 binding protein; similarity to Y4bM/Y4kI/Y4tA, Y4uH 5-265 Y4uH 248 this work 31 52 and weakly to Y4pL y4jA −2 188055-189569 504 147-494 IstA 507 U38187 25 42 identical to y4nE/y4sE; hyp. 57.2 kd protein with low 395-504 Fz4 110 this work 72 85 similarity to IS21/IS408/IS1162 transposases y4jB +3 190248-190706 152 24-79 ORF1 130 U19148 46 69 hyp. 16.7 kd protein; partially similarity to Y4hN; low similarity to Y4rG y4jC +2 190703-191056 117 1-115 ORFC 144 X84099 39 58 hyp. 13.1 kd protein; see y4hO 1-117 Y4hO 115 this work 36 62 y4jD +2 191105-192640 511 89-298 ORF2 258 M10204 36 53 hyp. 56.7 kd protein; see y4hP 340-453 ORF3 163 M10204 28 49 18-183 no1265 266 X74068 32 48 y4jE +1 192637-193458 273 hypothetical (fragmentous?) 29.4 kd integral membrane protein; put. frameshift: 192996 (1 > 2; end of shifted ORF at 193183) y4jF −1 194771-196330 519 hyp. 55.4 kd integral membrane protein y4jG −3 196333-196821 162 hyp. 17.9 kd transmembrane protein y4jH −2 196818-197435 205 hyp. 23 kd protein y4jI −3 197428-197820 130 hyp. 13.6 kd protein y4jJ +1 198043-198300 85 1-85 StbC 103 L48985 67 76 put. plasmid stability protein y4jK +3 198297-198719 140 1-138 StbB 139 L48985 57 76 put. plasmid stability protein y4jL +3 199002-199664 220 hyp. 25.1 kd protein y4jM −2 199746-199958 70 11-58 Y4bF 457 this work 75 79 hyp. 8 kd protein or protein fragment 15-58 fb1 188 this work 50 64 y4jN −3 199975-200415 146 hyp. 16.3 kd protein y4jO −3 201514-202479 321 hyp. 36.1 kd protein; y4jOP; two fragments of one gene?, put. frameshift: 202550 (−3 < −1) y4jP −1 202406-203194 262 hyp. 29.5 kd protein; y4jOP: two fragments of one gene?, put. frameshift: 202550 (−3 < −1) y4jQ +2 203729-206848 1039 hyp. 115.9 kd protein y4jR +1 206860-207315 151 hyp. 17.3 kd protein y4jS +1 207316-208557 413 hyp. 44.8 kd protein y4jT −1 208877-209887 336 17-283 Y4iH 239 this work 32 53 hyp. 36.4 kd protein precurser y4kA −3 209917-210885 322 hyp. 36.7 kd protein y4kB +1 211663-212088 141 hyp. 15.2 kd integral membrane protein fk2 −1 212111-212479 122 58-116 ORF14 104 X00493 59 76 hyp. fragment; sim. to Y4hP, Y4jD and Y4qI; additional homology to ORF14 in fr. +3/+2: 212331-212509 y4kD −1 212750-214399 549 hyp. 60.4 kd protein y4kE −1 214412-215455 347 hyp. 38 kd protein; y4kEF: two fragments of one gene?, put. frameshift: 215616 (−1 < −2) y4kF −2 215439-216743 434 hyp. 47.4 kd protein; y4kEF: two fragments of one gene?, put. frameshift: 215616 (−1 < −2) y4kG −2 216855-217064 69 hyp. 7.7 kd protein y4kH −3 217105-217488 127 hyp. 14.1 kd protein y4kI −1 217670-218461 263 — — — — — — see y4bM y4kJ −3 218458-220008 516 — — — — — — see y4bL y4kK −1 220103-221041 312 hyp. 34.9 kd protein y4kL −2 221049-222041 330 101-296 ORF300 300 U23723 39 56 hyp. 37.6 kd AAA-family ATPase protein y4kM +2 222641-222994 117 hyp. 13.1 kd protein y4kN +2 223115-223537 140 hyp. 15.7 kd protein y4kO +2 223970-224218 82 hyp. 9.2 kd protein y4kP +1 224215-224505 96 hyp. 11 kd protein y4kQ −2 224898-225326 142 hyp. (fragmentous?) 15.3 kd protein; homology to hipO fragments on the complementary strand fk1 +3 225094-225473 Z36940 fragments hom. to HipO y4kR −3 225535-225666 43 1-36 ORF6 347 M87280 55 66 hyp. 4.8 kd (fragmentous?) protein (smallest ORF predicted to be a protein); hom. to N-term. of protein in crtE-crtX intergenic region y4kS −3 225751-226656 301 1-301 ORF8 300 U12678 93 94 hyp. 33.2 kd protein y4kT −2 226653-228203 516 1-516 ORF7 516 U12678 93 94 hyp. 55.1 kd protein y4kU −3 228514-229512 332 1-332 ORF6 332 U12678 90 94 prob. geranyltranstransferase y4kV −3 229666-231009 447 92-447 CYP117 356 U12678 89 94 cytochrome P-450 BJ-4 homolog y4lA −2 231009-231845 278 1-274 ORF4 275 U12678 83 87 short-chain type dehydrogenase/reductase y4lB −3 231832-232140 102 1-58 ORF3 94 U12678 93 98 put. P450-system 3Fe-3S ferredoxin y4lC −2 232170-233573 467 48-428 CYP114 382 U12678 90 93 cytochrome P-450 BJ-3 homolog y4lD −1 233666-234868 400 3-400 CYP112 401 U12678 92 95 cytochrome P-450 BJ-1 homolog fl3 −2 235704-235904 66 2-54 ORF8 >207 X66124 60 71 hyp. 7.6 kd protein fragment, homology to ORF8 fragments also upstream of fl3 up to 236048 fl1 − 236796-237416 Z36981 homology to hupK/hupJ fragments (fr. −3/−2) y4lF +1 237508-238479 323 hyp. 36.1 kd protein y4lG +2 238490-238975 161 hyp. 17.4 kd protein y4lH −2 238959-239537 192 3-184 Fic 200 M28363 34 51 hyp. 22.4 kd protein; hom. to cell filamentation/division protein y4lI −2 239541-239750 69 hyp. 7.3 kd protein y4lJ −3 240358-240861 167 hyp. 18.1 kd protein fl2 − 240920-241040 X65471 fragments fo transposase (ISRm4) y4lK +1 241207-241605 132 hyp. 14.3 kd protein y4lL −2 241845-244328 827 118-816 SLR0359 1244 D63999 33 50 hyp. 91.8 kd protein (member of E. coli YegE/YhdA/YhjK/YjcC family) fl4 +1 244540-244851 103 19-103 TnpA 990 L14931 39 51 put. truncated transposase; hom. to N-term. of TnpA 28-81 F15 112 this work 94 98 (transposon Tn163); strong similarity to C-terminus of F15 y4lN +3 244848-245330 160 hyp. 18.1 kd protein y4lO −3 247156-247938 260 11-216 AvrRxv 373 L20423 36 50 hyp. 29.1 kd protein; hom. to avirulence protein; put. frameshift according to homolog: 247230-247293 (−2 < −3); end of shifted frame: 246960 fl5 +1 248290-248628 112 59-112 F14 103 this work 94 98 hyp. protein fragment; strong similarity to part of F14 fl6 +3 248814-249680 288 8-286 Tnp 988 M97297 27 49 put. fragmentous transposase; homologous to C-term. of transposase (Tn1546) y4lR +3 249696-251264 522 hyp. 56.8 kd protein y4lS +1 251407-251958 183 3-176 PaeR7IN 195 S78872 42 56 put. integrase/recombinase (“resolvase-type”) 4-181 Y4cG 305 this work 40 60 y4mA +3 251955-252380 141 hyp. 15.8 kd protein fm1 − 254694-254920 fragments hom. to xylitol-dehydrogenase y4mB +3 255450-256139 229 59-229 ORF4 212 X13583 33 53 hyp. 24.6 kd outer membrane protein precurser y4mC +2 256811-257524 237 hyp. 26.2 kd protein precurser y4mD −1 258065-258334 89 hyp. 10 kd protein y4mE −3 259030-260292 420 6-334 HipA 440 M61242 32 46 hyp. 45.7 kd protein 2-417 Y4dM 409 this work 34 56 y4mF −2 260289-260519 76 11-47 ORF3 90 X06090 37 70 hyp. transcriptional regulator; very low similarity to phage repressor proteins y4mG +3 261174-261395 73 hyp. 7.8 kd protein y4mH −2 261747-262640 297 hyp. 33.9 kd protein y4mI −2 262698-263672 324 11-252 RbsB 296 M13169 25 49 prob. ABC transporter periplasmic binding protein precurser (transport system Y4mIJK probably transports a sugar) y4mJ −3 263716-264717 333 12-323 RbsC 321 M13169 34 55 prob. ABC transporter permease y4mK −2 264714-266207 497 8-489 RbsA 501 M13169 34 55 prob. ABC transporter ATP-binding protein y4mL −3 266218-267477 419 1-418 HI1029 425 U00079 33 58 put. permease (E. coli YiaN/YgiK family) y4mM −2 267474-269099 541 38-360 HI1028 328 U32729 33 54 put. permease (SBR family 7) y4mN −1 269096-270133 345 37-340 Tkt 655 U09256 36 54 hyp. transketolase family protein (fragmentous?); hom. to C-term. of transketolases y4mO −3 270130-270969 279 9-270 Tkt 655 U09256 36 52 hyp. transketolase family protein (fragmentous?); hom. to N-term. of transketolases y4mP −3 271000-271761 253 4-249 F09E10.3 255 U41749 41 60 put. short-chain type dehydrogenase/reductase y4mQ +1 271909-272805 298 1-289 PerR 297 U57080 48 65 hyp. transcriptional regulator (LysR family) y4nA −2 273204-275384 726 45-302 ORF 690 D14005 21 36 prob. peptidase; very low similarity to Y4qF and Y4sO 365-718 38 54 (<25% identity) y4nB nodU −3 276451-278127 558 1-558 NodU 558 X89965 100 100 inv. in 6-O-carbamoylation of Nod factors; similar to Y4hD y4nC nodS −1 278144-278794 216 1-216 NodS 216 J03686 100 100 methyltransferase inv. in Nod-factor synthesis y4nD −3 280453-281349 298 — — — — — — see Y4iQ y4nE −2 281346-282860 504 — — — — — — see Y4jA fn1 + 283238-283467 241 M26938 hom. to virG fragments; similar to fq3 y4nF +3 283809-284501 230 hyp. 25.4 kd protein precurser; low similarity to Y4aO (<30% id.) fn2 − 284752-284923 X79443 fragments hom. to ORF2 (IS-ATP-binding protein) from IS1162 y4nG +2 285407-286597 396 53-365 ORF4 333 U08223 31 47 put. NAD-dep. nucleotide sugar epimerase/dehydrogenase y4nH +1 286594-286947 117 5-113 MvrC 110 M62732 30 47 hyp. 12.3 kd integral membrane protein (some similarity to ethidium bromide resistance proteins) y4nI +2 286964-287326 120 hyp. 13 kd transmembrane protein y4nJ +1 287335-288852 505 80-266 BetA 548 U39940 29 44 hyp. GMC-type oxidoreductase 343-468 32 45 y4nK −2 288906-290894 662 hyp. integral membrane protein y4nL −3 290914-291984 356 14-345 ORF6 328 U47057 26 45 put. NAD-dep. nucleotide sugar epimerase/dehydrogenase y4nM −3 292003-293553 516 226-514 NoeC 307 L18897 30 52 put. permease y4oA −3 294502-296283 593 328-494 MccB 350 X57583 29 41 hyp. 65.2 kd protein; homolog inv. in production of the 4-590 Y4qC 583 this work 30 50 translation inhibitor microcin C7 y4oB +1 296572-296961 129 hyp. 14.7 kd protein y4oC +1 296965-297657 230 hyp. 26 kd protein y4oD −1 297746-298390 214 hyp. 23.5 kd protein y4oE −3 298939-299148 69 hyp. 7.4 kd protein fo1 −2 299145-299588 147 fo1 and fo2: two fragments of one put. gene; put. frameshift: 299664 (−2 < −3) fo2 −3 299578-299955 125 25-109 ORF11 344 X53264 37 63 homology to 5′part of ORF11; 1-123 Y4cM 325 this work 25 51 fo1 and fo2: two fragments of one putative gene; put. frameshift: 299664 (−2 < −3) fo3 +3 300015-300815 267 15-252 Tnp 518 L09108 40 59 fo3 and fo7: transposase-like protein interrupted by NGRIS-6 fo4 −2 300828-301259 143 1-143 Y4bA 694 this work 77 83 hyp. fragment; fo4/5/6: fragments of one gene similar to Y4bA/Y4pH fo5 −1 301274-301684 136 1-127 Y4bA 694 this work 83 94 hyp. fragment; fo4/5/6: fragments of one gene fo6 −2 301608-302900 430 1-393 Y4bA 694 this work 89 95 hyp. fragment; fo4/5/6: fragments of one gene y4oL −3 302890-303156 88 1-88 Y4bB 98 this work 63 69 hyp. 9.6 kd protein y4oM −1 303179-303628 149 1-149 Y4bC 149 this work 79 88 hyp. 16.8 kd protein y4oN −2 303810-304022 70 1-70 Y4bD 89 this work 73 84 hyp. 8.1 kd protein fo7 +2 304118-304453 111 4-103 Tnp 518 L09108 40 59 fo3 and fo7: transposase-like protein interrupted by NGRIS-6 y4oP +1 304861-306156 431 47-429 u1756v 469 U15180 27 42 prob. ABC transporter binding protein (Y4oPQRS: sugar- like transport system) y4oQ +2 306236-307165 309 31-301 MalF 310 U15180 35 56 prob. ABC transporter permease protein y4oR +2 307178-308011 277 12-277 MalG 296 U15180 30 52 prob. ABC transporter permease protein y4oS +1 308008-309123 371 7-369 UgpC 369 U00039 50 68 prob. ABC transporter ATP-binding protein y4oT −2 309132-309722 196 2-196 Y4pA 609 this work 28 50 hyp. 20.6 kd protein; homologous to N-terminus of Y4aA, and weakly to Y4oV y4oU +1 309853-311061 402 hyp. 43.1 kd protein precurser y4oV +2 311051-311908 285 3-280 Y4pA 609 this work 32 56 hyp. 30.2 kd protein; homologous to N-terminus of Y4pA, and weakly to Y4oT y4oW +1 311911-312561 216 hyp. 23.7 kd protein y4oX +3 312606-313688 360 36-233 MocA 317 X78503 29 44 prob. NAD-dep. oxidoreductase y4pA +1 313714-315543 609 310-596 HydG 441 U00006 33 50 put. transcriptional regulator (sigma54-dep.) 6-290 Y4oV 285 this work 32 56 35-237 Y4oT 196 this work 28 50 y4pB otsB +3 316350-317147 265 30-260 OtsB 266 X69160 41 57 prob. trehalose-6-phosphate phosphatase y4pC otsA +1 317185-318579 464 1-456 OtsA 474 X69160 46 66 prob. trehalose-6-phosphate synthase; similar to fq1/2 fp1 + 318915-319242 U08864 fragments homologous to ORF3; put. frameshift acc. to homologue: 319122 (3 > 1) fp2 + 319236-319670 U08864 fragment homologous to ORF1 from IS1248 (fr. 3); similar to fs4 y4pD −1 319601-320116 171 13-140 Ros 142 M65201 50 71 put. transcriptional regulator (MucR family); missing Zn finger motif; similar to Y4aP y4pE −1 320606-321013 135 1-135 222 U18764 91 94 identical to y4sA; hyp. 15.5 kd protein hom. to N-term. of RFRS9 25kDa protein y4pF −2 321297-322460 387 50-374 Tnp 334 Z48244 43 60 identical to y4sB; put. transposase; low similarity to Y4qE, Y4iB and Y4iO (<30% aa-id.) y4pG −3 322486-323064 192 1-191 ORFA 197 U22323 47 64 identical to y4sC; hyp. 21.1 kd protein fp3 +2 323189-323956 X79443 “ORF” homologous to ORF1 of IS1162 interrupted by stop codon (323444) y4pH −1 323969-326053 694 — — — — — — see y4bA y4pI −2 326043-326309 88 — — — — — — see y4bB y4pJ −3 326329-326778 149 — — — — — — see y4bC y4pK −1 326969-327238 89 — — — — — — see y4bD fp4 +1 327277-328059 L09108 48 65 fragment homologous to put. IS-ATP-binding protein y4pL +3 328071-328808 245 1-204 ORF2 231 X79443 51 63 put. insertion sequence ATP-binding protein; similarity to 1-242 Y4bM 263 this work 55 73 Y4bM/Y4I/Y4tA, Y4uH, and weakly to 1-245 Y4uH 248 this work 61 77 Y4iQ/Y4nD/Y4sD (<30 aa-id.) y4pM +2 329159-329977 272 hyp. 30.9 kd protein fp5 − 330657-331414 put. frameshift: 331032 (2 < 1) y4pN syrM1 −3 332506-333522 338 13-324 SyrM 326 M33495 63 77 probable symbiotic regulator (LysR family) 1-338 SyrM2 339 this work 62 79 y4pO +1 335062-336264 400 1-400 Tnp 400 M60971 96 98 prob. transposase (Mutator family); similarity to fe7 fq2 −2 333987-335003 338 1-320 OtsA 474 X69160 44 61 join fq1 + fq2: hom. to trehalose-6-phosphate synthase interrupted by ISRm3-like element NGRIS-8; similarity to Y4pC (45% aa-id.) fq1 −1 336311-336694 128 44-174 OtsA 474 X69160 48 67 see fq2 fq3 + 337338-338056 M26938 virG homologous fragments: stop at 37380; put. frameshift at 337844 (3 > 2); similar to fn1 y4qB −1 339053-339547 164 hyp. 18.8 kd protein y4qC −3 339535-341286 583 314-489 ORF 401 Z54354 28 46 hyp. 63.6 kd protein 1-583 Y4oA 593 this work 30 50 y4qD −3 343216-343950 244 1-244 Y4rO 618 this work 55 74 hyp. 26.8 kd protein, similar to N-terminus of Y4rO y4qE +2 344114-345286 390 37-380 Tnp 364 X77623 38 57 prob. transposase; low similarity to Y4pF/Y4sB, Y4iB, Y4iO and Y4rJ (<30% aa-id.) fq4 +3 345798-346130 M38257 34 51 fragments homologous to XerC (integrase) y4qF −2 346215-348479 754 41-725 PtrII 707 D10976 31 49 prob. peptidase (S9A family); high similarity to Y4sO; 32-736 Y4sO 705 this work 70 84 low similarity to Y4nA (<25% id.) y4qG −2 348501-349847 448 40-389 YgiG 454 U32722 42 62 prob. aminotransferase (class 3) y4qH −1 350294-351274 326 144-326 LasR 239 M59425 37 51 hyp. transcriptional regulator (LuxR family) y4qI −2 351837-353456 539 146-419 ORF1 322 M25805 44 63 hyp. 59.7 kd protein; similar to Y4aQ, Y4hP, Y4jD fq5 −3 353533-353775 fragments fq5 and fr3 represent one put. gene similar to Y4hO and Y4jC interrupted by IS elements y4qJ −1 354140-355336 398 7-395 TnpA 388 U14952 42 60 put. transposase y4qK −2 355344-356270 308 51-293 Int 259 U14952 39 55 put. integrase/recombinase (“phage-type”); similar to 51-308 Y4eF 251 this work 92 94 Y4rF; low similarity to Y4rABCDE fq6 −2 356436-356636 66 1-66 Fe5 66 this work 79 94 put. defective integrase/recombinase (“phage-type”); 75% Y4rC 332 this work 45 62 nt-identity; 356436-356710 and 94988-95262 [R-20] y4rA +1 356803-358032 409 17-397 ORF2 415 L34580 39 55 put. integrase/recombinase (“phage-type”) y4rB +3 358029-358973 314 135-267 TnpI 284 X07651 30 51 put. integrase/recombinase (“phage-type”) y4rC +2 358970-359968 332 22-294 XerC 295 U32696 31 50 put. integrase/recombinase (“phage-type”) 267-332 Fe5 66 this work 41 55 267-332 Fq6 66 this work 45 62 y4rD −3 360025-360870 281 15-277 XprB 298 M54884 25 46 put. integrase/recombinase (“phage-type”) y4rE −2 360867-361799 310 50-288 YqkM 296 D84432 27 48 put. integrase/recombinase (“phage-type”); low similarity to Y4gA y4rF −1 361796-363073 425 126-414 ORF2 415 L34580 34 49 put. integrase/recombinase (“phage-type”) y4rG −1 363287-363694 135 16-109 ORF1 130 U19148 32 48 hyp. 14.8 kd protein (IS866 family); low similarity to Y4jB, Y4hN y4rH −3 363895-365331 478 62-374 Bcp 598 X63470 26 44 put. ligase; hom. to biotin carboxylases fr1 −3 366307-366669 85% aa-identity to part of Y4rL fr2 − 366594-367402 put. frameshift: 367296 (−2 < −1) fr3 −3 367705-367827 hom. to N-term. of Y4hO; see fq5 y4rI −3 368503-369675 390 hyp. 44 kd protein y4rJ +1 369697-370887 396 152-379 Tnp 339 M80806 28 45 put. transposase; low similarity to Y4qE (<30% aa-id.) 135-244 Y4iA 120 this work 74 87 266-396 Y4iB 130 this work 76 84 135-396 Y4iO 252 this work 71 83 2-131 Y4iP 131 this work 58 80 y4rK −1 370976-371350 124 hyp. 14.5 kd protein y4rL −2 371454-371921 155 1-99 Y4zA 295 this work 99 99 hyp. 17.7 kd protein; y4rLM: two fragments of one 17-155 Y4iE 135 this work 33 52 gene?; put. frameshift: 371972 (−2 < −3); 85-99% aa- identity to parts of Y4zA and fr1 y4rM −3 371938-372990 350 258-339 Y4zA 295 this work 98 98 hyp. 39.4 kd protein; see y4rL y4rN −2 373578-374795 405 35-368 P43 416 X57470 26 44 hyp. 41.6 kd integral membrane protein y4rO +1 375313-377169 618 274-596 H1N0578 366 U32742 25 45 hyp. 69.3 kd protein; N-terminus: hom. to Y4qD; C- 1-244 Y4qD 244 this work 55 74 terminus: hom. to C-terminus of histidinol-1-phosphate transaminase fr4 + 377185-377534 X66016 sim. to Y4rG; put. frameshift: 377376 (1 > 3); hom. to fragment of ORFA3 (377409-377540) y4sA −3 377842-378249 135 — — — — — — seey4pE y4sB −1 378533-379696 387 — — — — — — see y4pF y4sC −2 379722-380300 192 — — — — — — see y4pG y4sD −1 380933-381829 298 — — — — — — see y4iQ y4sE −3 381826-383340 504 — — — — — — see y4jA fs5 −3 383593-384054 153 8-150 Tnp 334 Z48244 48 65 put. defective transposase; sim. to fs1 384210-384493 fragments with 94-84% nt-id. to ISRm6 (R. meliloti; acc. no. X95567) y4sG +1 384808-385818 336 97-325 Ddl 306 M14029 34 57 hom. to D-alanine:D-alanine ligase; probably different function y4sH +3 386505-387890 461 267-337 CapA 411 M24150 42 63 hom. to encapsulation protein A; nearly identical to Y4aA fs1 − 388138-388586 Tnp Z48244 fragments of put. transposase; put. frameshift: 388452 (−3 < −2); sim. to Y4pF, Y4sB, fs5 fs2 +2 388697-388897 ORF1 U19148 43 62 put. transposase fragment; hom. to N-term. of ORF1; sim. to Y4jB, Y4rG, Y4hN fs3 + 388966-390695 AtoC U17902 put. transcriptional regulator fragment (put. frameshifts: 389891 (1 > 2); 390170 (2 > 3)); sim. to Y4pA, Y4oV, Y4oT) y4sI +2 390971-393241 756 325-741 McpA 657 X66502 41 60 prob. methyl-accepting chemotaxis protein 1-749 Y4fA 845 this work 29 49 y4sJ gapD −3 393202-394677 491 29-489 GabD 482 M88334 58 75 prob. succinate-semialdehyde dehydrogenase y4sK −1 394790-395170 126 5-122 C23G10.2 185 U39851 55 71 bel. to the YER057C/YIL051C/YJGF family; probably important cellular function y4sL −1 395204-395815 203 2-203 DadA 432 L02948 57 74 either functional dehydrogenase or non-functional fragment; hom. to small subunit of D-aminoacid dehydrogenase y4sM +1 395935-396318 127 1-127 ORF1 127 X74314 99 99 put. transcriptional regulator (AsnC/Lrp family; low homology to y4tD); missing H-T-H region y4sN +1 396523-396900 125 1-123 ORF2 >123 X74314 98 98 similar to ORFs derived from insertion elements (IS6501 family); low similarity to fu4 fs4 + 396855-397283 (143 8-141 ORF1 186 X53945 48 63 put. IS-derived protein fragment (homology to C-term. of 1-141 Fp2 145 this work 39 62 ORF1 from IS869) y4sO −2 397608-399725 705 10-694 PtrII 706 D10976 32 49 prob. peptidase (S9A family); low similarity to Y4nA 1-705 Y4qF 754 this work 70 84 (<25% id.) ft1 +3 400377-400625 (83) 20-83 Y4tE 300 this work 64 78 ft1 and ft2: one put. gene encoding an amino acid ABC transporter binding protein interrupted by NGRIS-3c. y4tA −3 400732-401523 263 — — — — — — see y4bM y4tB −2 401520-403070 516 — — — — — — see y4bL ft2 +1 403249-403899 (216) 5-195 ArgT 260 V01368 25 48 see ft1 2-215 Y4tE 300 this work 76 86 y4tD +1 404182-404691 169 11-161 HIN1362 168 U32817 38 64 put. transcriptional regulator (AsnC/Lrp family; but low homology to y4sM) y4tE +1 405157-406059 300 31-281 FliY 257 U32734 27 48 prob. aminoacid ABC transporter binding protein 86-299 Ft2 215 this work 76 86 (periplasmic); prob. part of binding-protein-dep. transport system Y4tEFGH y4tF +1 406111-406827 238 25-233 YckJ 234 X77636 35 54 prob. aminoacid ABC transporter permease protein y4tG +3 406830-407525 231 1-220 GlnP 226 D30762 32 54 prob. aminoacid ABC transporter permease protein y4tH +2 407522-408295 257 5-256 GlnQ 242 M61017 52 71 prob. aminoacid ABC transporter ATP-binding protein y4tI +1 408745-409953 402 22-391 Slr0072 393 D64004 35 54 put. peptidase (M40 family) y4tJ +1 409990-410988 332 7-328 Thd2 329 M21312 35 57 put. threonine dehydratase y4tK +3 410988-411983 331 69-326 ArcB 351 U39262 30 44 hyp. cyclodeaminase; (sim. to ornithine cyclodeaminase) y4tL +2 412118-413290 390 10-384 ORF 411 D14463 27 45 hyp. hydrolase/peptidase (M24 family) 1-389 Y4tM 392 this work 34 53 y4tM +2 413453-414631 392 17-390 PepQ 368 Z34896 24 43 put. hydrolase/peptidase (M24 family) 1-390 Y4tL 390 this work 34 53 y4tN +1 414655-415179 174 hyp. 19.6 kd protein y4tO +1 415252-416847 531 1-484 OppA 543 M60918 28 46 prob. peptide ABC transporter binding protein presurser; prob. part of a binding-protein-dependent transport system Y4tOPQRS y4tP +2 416852-417793 313 4-313 DppB 339 L08399 36 58 prob. peptide ABC transporter permease protein y4tQ +1 417796-418671 291 9-287 AppC 303 U20909 36 56 prob. peptide ABC transporter permease protein; 418611; C or T possible! y4tR +2 418673-419680 335 12-327 OppD 336 X56347 50 68 prob. peptide ABC transporter ATP-binding protein y4rS +1 419677-420738 353 3-320 AppF 329 U20909 49 69 prob. peptide ABC transporter ATP-binding protein y4uA +3 420774-422159 461 267-337 CapA 411 M24150 42 63 put. cell wall compound biosynthesis protein; almost identical to Y4sH y4uB +3 422628-424031 467 1-464 BioA 448 U51868 33 57 prob. aminotransferase (class 3) y4uC +3 424056-425594 512 58-509 GabD 482 M88334 33 52 prob. aldehyde dehydrogenase fu1 +2 425699-425779 N15K 238 D45911 put. protein fragment; 67% id. to N15K in 26 aa fu2 +3 425841-426083 PhbA 393 U17226 fragment 65% identical to C-term. of beta-keto-thiolase y4uD +1 426010-426507 165 hyp. 18.7 kd protein y4uE −3 426949-428028 359 78-290 Tnp 414 X15942 31 45 put. transposase (IS110 faminly); put. frameshift: between 427040 and 427180 (−2 < −3; end of shifted ORF: 426699) y4uF +3 428292-429623 443 13-440 GLUD1 558 X07674 42 60 prob. glutamate dehydrogenase fu3 + 429860-430007 Tnp 398 U08627 put. transposase fragment (92% id. in 16 aa); 85% nt- identity to 3′term. part of ISRm5 y4uG +1 430105-430320 71 hyp. 7.8 kd protein y4uH −1 430538-431284 248 1-202 ORF2 231 X79443 48 63 put. insertion sequence ATP-binding protein; similarity to 1-245 Y4pL 245 this work 61 77 Y4pL, Y4bM/Y4kI/Y4tA and Y4iQ/Y4nD/Y4sD 1-248 Y4bM 263 this work 48 68 (IS21/IS1162 family) 4-248 Y4iQ 298 this work 31 52 y4uI −3 431296-432840 514 1-514 Tnp 518 L09108 44 63 put. transposase; similiarity to Y4bL/Y4J/Y4tB (IS21/IS1162 family) fu4 − 433222-433560 Tnp 201 X65471 put. tranposase fragments (74-92% id. in 88 aa); 79% nt- identity to 5′term. of ISRm4 y4uJ fixU −1 433880-434110 76 1-70 FixU 70 X51963 63 80 hyp. 8.5 kd protein y4uK nifZ −3 434107-434433 108 6-79 ORF2 >78 X07567 52 78 put. nitrogen fixation NifZ protein y4uL fdxN −2 434517-434711 64 1-64 FdxN 64 M21841 79 84 prob. 4Fe4S ferredoxin y4uM nifB −1 434753-436234 493 1-493 NifB 490 M15544 72 81 involved in FeMo cofactor biosynthesis y4uN nifA −1 436460-438244 594 37-594 NifA 584 U31630 62 74 positive regulator of nif, fix, and additional genes (sigma54-dep.) y4uO fixX −2 438297-438590 97 2-97 FixX 98 M15546 84 89 prob. 3Fe-3S ferredoxin inv. in nitrogen fixation y4uP fixC −1 438605-439912 435 1-435 FixC 435 M15546 82 89 required for nitrogenase acitivity y4vA fixB −2 439923-441032 369 18-363 FixB 353 M15546 79 87 putatively inv. in a redox process in nitrogen fixation y4vB fixA −2 441042-441899 285 1-280 FixA 292 M15546 75 90 putatively inv. in a redox process in nitrogen fixation fv1 −1 442181-442252 NifS 384 X68444 put. NifS fragment (70% identity in 24 aa) y4vC −1 442316-442636 106 1-106 ORF118 118 X13691 54 72 hyp. 11 kd protein (HesB/YadR/YfhF family); homologues located upstream of nifS y4vD −2 443313-443879 188 5-173 HIN1693 241 U32848 46 60 put. redox enzyme (hom. to glutaredoxin-like membrane protein and peroxysomal membrane proteins) y4vE nifQ +1 444337-445029 230 56-212 NifQ 180 M26323 39 56 putatively involved in Mo cofactor processing y4vF dctA1 +2 445088-446602 504 1-443 DctA1 456 S38912 99 99 C₄-dicarboxylate transport protein; nt-deletion at 446416 in comparison to sequence of acc. no. S38912 causing a frameshift (DctA1 is 48 aa longer than DctA1 in S38912) y4vG +1 446599-447843 414 13-413 CamC 415 M12546 34 50 prob. cytochromeP450 y4vH +1 447844-448500 218 (32-157 LinA 155 D90355 28 46) hyp. 24.6 kd protein (with very weak homology to gamma-hexachlorocyclohexane-dechlorinase) y4vI +3 448557-450203 548 9-250 FabG 244 U39441 38 56 short-chain type dehydrogenase/reductase 276-513 30 48 y4vJ +2 450341-451396 351 1-188 LuxA 357 M36597 27 47 put. monooxygenase; similar to Y4wF; y4vK nifHl +1 451993-452883 296 1-296 NifH 296 M26961 99 99 Fe protein of nitrogenase y4vL nifDl +1 452980-454494 504 199-393 NifD >195 M26962 98 99 alpha-subunit of MoFe protein of nitrogenase y4vM nifKl +3 454590-456131 513 132-195 NifK >64 M26963 100 100 beta-subunit of MoFe protein of nitrogenase y4vN nifE +1 456187-457677 496 1-469 NifE 547 X56894 62 78 involved in FeMo cofactor biosynthesis y4vO nifN +1 457687-459096 469 1-455 NifN 441 M18272 70 81 involved in FeMo cofactor biosynthesis y4vP nifX +3 459093-459575 160 22-156 NifX 159 X17433 52 68 nitrogen fixation protein y4vQ +3 459579-460067 162 22-162 ORF4 156 X17433 49 70 hyp. 17.7 kd protein, similar to proteins of other 1-162 Y4xD 162 this work 61 75 nitrogen-fixing bacteria and to Y4xD y4vR +1 460501-460920 139 1-58 NifH 296 M26961 50 63 similar to N-term. of Fe protein of nitrogenase y4vS fdxB +2 461228-461545 105 1-88 ORF5 102 M26323 52 65 prob. 4Fe-4S ferredoxin y4wA +1 463201-464739 512 86-499 PqqE 709 L43135 50 70 hyp. zinc protease (M16 family); sim. to Y4wB y4wB +3 464736-466079 447 236-438 PqqF 213 L43135 42 61 put. protease (lacks Zn-binding site; M16 family); sim. to Y4wA y4wC +3 466590-467021 143 8-132 ORF3 127 L13845 48 66 put. DNA-binding protein; high similarity to Y4aM 1-143 Y4aM 143 this work 69 77 y4wD +1 467758-468891 377 11-370 MosC 407 U23753 29 48 permease-type protein; hom. to membrane protein from the rhizopine biosynthesis (mosABC) gene cluster y4wE +3 469311-470417 368 20-361 His1 356 D14440 32 53 prob. aminotransferase (class 2) y4wF +1 470824-471852 342 40-194 LuxA 354 X06758 27 54 put. monooxygenase; sim. to Y4vJ y4wG +2 471890-472435 181 hyp. 19.4 kd protein y4wH +3 473343-473780 145 1-145 ORF2 145 M19352 64 76 hyp. 15.6 kd protein y4wI −2 473928-475469 513 hyp. 59 kd protein y4wJ −2 475503-475880 125 hyp. 13.3 kd protein y4wK nifW −1 476519-476971 150 12-118 NifW 108 M86823 50 63 NifW protein homolog; required for full activity of FeMo protein y4wL nifS −2 477135-478298 387 4-387 NifS 402 M17349 58 73 prob. NifS protein (member of class-5 pyridoxal- phosphate-dep. aminotransferase family) y4wM −2 479145-481136 663 225-620 YejA >409 U00008 38 55 put. ABC transporter binding protein (transporter or enzymatic function) fw1 −1 481460-481834 124 1-116 DctA 441 M26531 55 61 hyp. truncated transporter-like protein; hom. to N-term. of DctA (see y4vF); two frameshifts acc. to homologue: 481606 (−3 < −1); 481530 (−2 < −3; homology stops at 481419) y4wO −3 481834-482154 106 hyp. 11 kd protein y4wP +2 482540-482947 135 hyp. 14.9 kd protein y4xA nifH2 +1 483871-484761 296 1-296 NifH 296 M26961 99 99 Fe protein of nitrogenase y4xB nifD2 +1 484858-486372 504 199-393 NifD >195 M26962 98 99 alpha-subunit of MoFe protein of nitrogenase y4xC nifK2 +3 486468-488009 513 132-195 NifK >64 M26963 100 100 beta-subunit of MoFe protein of nitrogenase y4xD +3 488262-488750 162 22-162 ORF4 156 X17433 47 73 hyp. 18 kd protein; similar to proteins of other nitrogen- 2-162 Y4vQ 162 this work 61 75 fixing bacteria and to Y4vQ y4xE +1 488773-488976 67 1-64 ORF1 69 X55450 40 67 hyp. 7.6 kd protein; similar to proteins of other nitrogen- fixing bacteria y4xF +3 488973-489149 58 hyp. 6.5 kd protein y4xQ +2 489281-89583 100 14-83 ExoX 98 M6I751 31 52 put. exopolysaccharide production repressor (intrgal membrane protein) y4xG +2 490010-491527 505 hyp. 55.5 kd protein y4xH nodD2 −2 491655-492593 312 1-312 NodD2 312 L38460 99 99 transcriptional regulator (LysR family); high similarity to 1-310 NodD1 322 this work 68 83 Y4aL (NoD1) y4xI +2 494297-494977 226 1-224 PmrA 222 L13395 39 58 signal transduction-type regulator y4xJ +1 495157-496428 423 76-378 GPIV 426 J02451 27 46 hyp. protein hom. to proteins of the general secretion pathway (pulD family), sim. to Y4yD (NolW) y4xK +1 496438-497004 188 hyp. 20.6 kd protein precurser y4xL −1 497444-498460 338 hyp. 37.1 kd protein y4xM −1 498719-499933 404 23-403 ORF1 408 X59939 22 49 permease-type protein (YceE) y4xN −3 499930-501816 628 183-505 IucC 580 X76100 28 43 hyp. 71 kd protein hom. to aerobactin synthetase subunit y4xO −2 501816-502955 379 hyp. 40.9 kd protein y4xP −1 502952-503962 336 5-304 CysK 308 D26185 40 60 put. cysteine synthase y4yA −1 503963-505336 457 hyp. 49.9 kd protein; low similarity to diaminopimelate y4yB −3 505336-505800 154 hyp. 17.1 kd protein y4yC nolX −2 505950-507740 596 1-596 NolX 596 L12251 98 99 nodulation protein as in R. fredii USDA257 y4yD nolW −3 508021-508725 234 1-234 NolW 234 L12251 99 100 nodulation protein (PulD family); sim. to Y4xJ y4yE nolB +3 508881-509375 164 1-164 NolB 164 L12251 98 99 nodulation protein y4yF nolT +3 509385-510254 289 1-289 NolT 289 L12251 96 97 nodulation protein precurser (YscJ homolog; M74011) y4yG nolU +2 510251-510889 212 1-212 NolU 212 L12251 99 99 nodulation protein y4yH nolV +3 510891-511517 208 1-60 ORF4 65 L12251 100 100 homologous to two (nodulation) proteins of R. fredii 73-208 NolV 135 96 97 USDA257 (YscL homolog; M74011) y4yI hrcN +2 511514-512869 451 35-450 YscN 439 U00998 55 73 prob. ATPase involved in secretion 1-80 HrcN 450 L12251 97 97 105-450 97 98 y4yJ +1 512845-513381 178 1-178 ORF7 178 L12251 97 98 hyp. 20.4 kd protein y4yK hrcQ +1 513406-514482 358 171-350 YscQ 307 L25667 27 46 prob. translocation protein inv. in secretion processes 1-358 HrcQ 382 L12251 96 98 (FliN/MopA/SpaO family) y4yL hrcR +2 514475-515143 222 6-216 YscR 217 L25667 46 66 prob. translocation protein inv. in secretion processes 1-222 HrcR 249 L12251 99 99 (Flip/MopC/SpaP family) y4yM hrcS +1 515143-515418 91 1-66 YscS 88 L25667 34 65 prob. translocation protein inv. in secretion processes 1-91 HrcS 92 L12251 98 100 (FliQ/MopD/SpaQ family) y4yN hrcT +3 515427-516245 272 28-250 YscT 261 L25667 31 52 prob. translocation protein inv. in secretion processes 1-272 HrcT 272 L12251 98 99 (FliR/MopE/SpaR family) y4yO hrcU +2 516242-517279 345 5-339 YscU 354 L25667 30 50 prob. translocation protein inv. in secretion processes 1-340 HrcU 351 L12251 99 99 (FlhB/HrpN/YscU/SpaS family) y4yP +1 518077-518892 271 35-262 HipA 295 M19019 88 91 homolog is inducible by root-exudate and diadzein; frameshift acc. to homologue: 518855 (1 > 2) fy1 + 519655-519995 NolJ 148 L26967 nodulation gene homologous fragments (80-100% id. in 97 aa); frameshifts acc. to homologue: 519789 (1 > 3); 519900 (3 > 2); 519965 (2 > 3) y4yQ +2 520280-521170 296 hyp. 31.3 kd integral membrane protein y4yR +2 521360-523453 697 17-677 LcrD 704 M96850 40 65 prob. translocation protein inv. secretion processes [Flagella/HR/Invasion proteins export pore (FHIPEP) family] y4yS +3 523470-524018 182 hyp. 20.1 kd protein y4zA +2 525005-525892 295 34-115 Y4rM 350 this work 98 98 hyp. (fragmentous?) 32.9 kd protein; put. frameshift: 133-231 Y4rL 155 this work 99 99 525699 (2 > 3); similar to Y4iE y4zB +1 526051-527121 356 60-320 Tnp 377 X67862 29 47 put. (fragmentous?) transposase (IS4 family) 526103- 526200 higher cod. prob. in fr. 2; put. frameshift: 526200 (2 > 1) fzl + 527337-527902 Hdc 378 J02577 fragments homologous to histidine decarboxylases (30- 45% id. in 134aa); put. frameshift (3 > 2) around 527478 y4zC +3 529125-529910 261 65-248 AvrPph3 276 M86401 27 41 hyp. 28.3 kd protein; hom. to avirulence protein y4zD +3 530145-530294 49 hyp. 5.5 kd protein fz4 +2 530432-530764 110 1-110 Y4jA 504 this work 72 85 hom. to C-terminus of Y4jA/Y4nE/Y4sE fz2 + 530761-531250 ORFB 251 X67861 put. IS-ATP-binding protein fragments (32-40% id. in 137aa); put. frameshift acc. to homolog: 531062 (1 > 2) y4zF syrM2 +2 532676-533695 339 1-320 SyrM 326 M33495 69 81 prob. symbiotic regulator (LysR family) 1-335 SyrM1 338 this work 62 79 fz3 + 534257-534422 ORF 338 M73488 fragments homologous to 1-aminocyclopropane-1- carboxylate deaminase (63-83% id. in 56aa); put. frameshift: 534291

[0059] role of some ORFs like the luciferase-like ORFs (y4vJ and y4wF; see Table 3) in rhizobia is still not clear. In the 100 kb region, the duplication of a 5 kb sequence (position 451,886 to 456,157 and 483,764 to 488,035) including the genes nifHDK is remarkable. These genes encode the basic subunits of the nitrogenase. Furthermore, the transcriptional regulator nodD2 is very interesting because its role seems not to be identical to a previously identified nodD2 in a closely related strain (Appelbaum et al., 1988; data not shown). Also the pmrA-homologous ORF y4xI putatively plays an important role in regulating symbiotic processes because a nod box (binding region for the basic regulator nodD1; Fisher and Long, 1993) is located upstream of this ORF (position 493,962 to 494,000). Finally, the presence of ORFs (y4yI and y4yK to y4yN; see Table 3) homologous to type III secretion proteins, which have only been known previously in plant or animal/human pathogenic bacteria, shows that there only seems to be a subtle difference between symbiotic and pathogenic abilities of microorganisms.

[0060] In a second stage, the remaining 436 kb of pNGR234a were analyzed. Several ORFs and their deduced proteins were identified that belong to functional groups not previously identified in the analysis of cosmids pXB296, pXB368 and pXB110 (replication of the plasmid, conjugal transfer of the plasmid, functions in oligosaccharide biosynthesis and cleavage, functions in sugar or sugar-derivative metabolism, functions in lipid or lipid-derivative metabolism, functions in chemoperception/chemotaxis, functions in biosynthesis of cofactors, prosthetic groups and carriers, etc.).

[0061] Although further functional analyses of selected ORFs in pNGR234a still have to be performed, large-scale sequencing gives a global picture of their genomic organization and possible roles. Determination of putative functions of predicted genes by homology searches and identification of sequence motifs (promoters, nod boxes, nifA activator sequences, and other regulatory elements) will aid in finding new symbiotic genes. High-fidelity sequence data covering long stretches of the genome are a prerequisite for these studies. The use of the dye terminator/thermostable sequenase shotgun approach has allowed the completion of the entire ˜500 kb sequence of pNGR234a and has opened up new avenues for the genetic analysis of symbiotic function.

[0062] Genetic Organization of the Whole Plasmid pNGR234a

[0063] Within the complete nucleotide sequence of pNGR234a, which comprises 536,165 bp, a total of 416 ORFs were predicted to encode proteins. An additional 67 ORF-fragments were detected that seem to be remnants of functional ORFs.

[0064] Thirty four percent (139) of the 416 potential proteins, have no obvious similarities to any known proteins. Of the remaining 277 proteins, 31 (8%) are similar to proteins for which no biochemical or phenotypic role has been assigned, 12 (3%) are similar to proteins for which limited biological data is available, and 234 (56%) are similar to proteins with a more precise biological function: enzymes (95), proteins involved in integration and recombination of insertion elements (44), transporters (32), transcriptional regulators (22), protein secretion/export (21), proteins involved in replication and control of the plasmid (12), electron transporters (6), and proteins involved in chemotaxis (2). A high proportion of enzymes was expected of a symbiotic replicon involved in nodulation (Nod-factor biosynthesis, etc.) and nitrogen fixation. As expected from the observation that NGR234 can be cured of its plasmid (Morrison et al., 1983), no ORFs essential to transcription, translation or to primary metabolism were found.

[0065] A large number of protein families are present in several copies on pNGR234a. This is true even after elimination of the many proteins which are encoded in repeated IS elements, or are involved in transposition, integration or recombination. The most notable examples of highly represented protein families include: five members of the short-chain dehydrogenase/reductase family, one of which (y4vI) contains two homologous domains; five complete and one partial ABC-type transporter operons that each encode for at least one ABC-type permease and an ABC-type ATP-binding protein; four cytochrome P450's; and three members of peptidase family S9A. In total, 85 proteins belong to families that are represented more than once and which do not seem to be linked to insertion or recombination.

[0066] The majority (330, 79%) of the putative proteins are probably located in the cytoplasm of the bacterium, 62 (15%) possibly span membranes, 20 (5%) could be located in the periplasm, 3 are predicted to be lipoproteins that could associate with the outer membrane, and 2 are probably outer membrane proteins. These observations accord well with the dominance of biosynthetic proteins, as well as proteins involved in transcriptional regulation and insertion/recombination, most of which are thought to be cytoplasmic.

[0067] Although other start points cannot be excluded, replication of pNGR234a probably begins at oriV which is located within the intergenic sequence (igs) between the repC and repB-like genes y4cI and y4cJ. This locus (positions 54,417 to 54,570) encodes three proteins with 40-60% amino acid identities to RepABC of pTiB6S3 (a Ti-plasmid of Agrobacterium tumefaciens), pRiA4b (an Ri-plasmid of A. rhizogenes) and pRL8JI (a cryptic plasmid of R. leguminosarum bv. leguminosarum). Amongst replication regions, highest identities (69 to 71% at the nucleotide level) are found in the igs's between repC and repB (FIG. 5). In Agrobacterium, these igs's are the determinants which render parental plasmids incompatible. Two ORF's (position 198,500), which are homologous to pseudomonal genes involved in plasmid stability, may also play a role in replication of pNGR234a. A 12 bp portion of the origin of transfer (oriT) is identical to that of pTiC58 of Agrobacterium tumefaciens (nt 80,162 to 80,173), and highly similar to those of RSF1010 (Escherichia coli) and pTFI (Thiobacillus ferrooxidans). This sequence corresponds to the oriT of plasmids containing the “Q-type nick-region” (FIG. 6).

[0068] Another 24 predicted ORFs show homologies to conjugal transfer genes of Agrobacterium Ti-plasmids. All are located in two large clusters between position 57,000 to 83,000. Since pNGR234a was believed to be non-transmissible (Broughton et al., 1987), the fact that both the nucleotide sequence of the individual ORFs and their order is similar in Agrobacterium and NGR234 came as a surprise. Conjugal transfer of Ti plasmids in A. tumefaciens is controlled by a family of N-acyl-L-homoserine lactone auto-inducers (Zhang et al., 1993). Similar molecules, which are able to interact with the traR gene product of A. tumefaciens, were detected in the supernatants of NGR234 cultures using the assay of Piper et al. (1993).

[0069] Reiterated sequences first became apparent in NGR234 during the construction of an ordered array of cosmid clones (Perret et al., 1991). It is now clear that 97 kbp (18%) of pNGR234a represents insertion-(IS) and mosaic-(MS) sequences (FIG. 7). Homology searches for known IS/MS revealed some of these, while comparison of repeated sequences within pNGR234a, as well as between the plasmid and 2,500 random chromosome sequences (V. Viprey, pers. communication) located the rest. Seventy five putative ORFs (18% of the total) and 40 fragments of ORFs were identified this way, nearly half of which (44) show homologies to integrases and transposases. Many of these IS elements are similar not only to those derived from Rhizobium and Agrobacterium species, but also to those of other, diverse Gram (−) and Gram (+) bacteria (e.g. Bacillus, Escherichia, and Pseudomonas). The shear number and diversity of these IS/MS elements suggests that NGR234 has functioned as a “transposon trap”. This is supported by the fact that their average G,C content (61.5%) is 3% higher than that of pNGR234a (58.5%). Interestingly, many IS/MS are clustered between positions 300,000 to 390,000 (FIG. 7), while some loci are almost unaffected by insertions (oriV, nod-, fix- and nif-ORFs). Small IS/MS clusters divide the replicon into large blocks of often functionally related ORFs (e.g. blocks of nod-ORFs, replication and conjugal transfer ORFs, nif-ORFs and fix-ORFs). A list of all sequences with IS-elment or mosaic sequence character is given in Table 4. Although transposition of these IS/MS elements has not been demonstrated, transfer of plasmids amongst rhizobia in the legume rhizosphere (Broughton et TABLE 4 Insertion/mosaic sequences in pNGR234α put. ORFs/ ORF-fragments homologous start of region stop of region name of region included similarities within pNGR234α similarities to chromosome sequences in other organisms/comments 17000 17600 ISH-10b y4aQ 33% aa-id. to y4hP (ISH-10a) geneproducts from IS866 and IS66 from Ag. tumefaciens 18900 19661 ISH-11b fa2 54% aa-id. to part of y4bF (ISH-11a); Tnp of IS1202 from Str. pneumoniae 19096-19362: 91% nt-id. to ISH-11c 19666 22981 NGRIS-4a y4bABCD identical to NGRIS-4b many copies on the chromosome 22985 25400 ISH-11a fb1, y4bF y4bF: sim. to fb1 and fa2 (ISH-11b) partially 91% nt-id. to chromosomal sequences Tnp of IS1202 from Str. pneumoniae 32463 35085 NGRIS-3a y4bLM identical to NGRIS-3b/c copie(s) on the chromosome 62% nt-id. (over 2352 nt) to IS1162 of Ps. fluorescens (IS21/IS1162/IS408 family) 49300 50300 ISH-13a y4cG similar to y41S (ISH-13b) DNA invertase 69936 70385 ISH-4c fd1 70233-70385: 93% nt-id. to part of ORFA of IS5376 from B. stearothermophilus NGRIS-4 93322 96025 ISH-12a fe2, y4eF, fe5, 93574-94927: 90% nt-id. to ISH-12b1; Tnp (fe2) and Int (Y4eF, fe3) from Weeksella fe3 75% nt-id. to fq6 region (ISH-12b2); zoohelcum-IS-element; (93322-94586: 57% nt-id. to 95343-95558: 88% nt-id. to ISH-12b3 IS292 from Ag. radiobacter); “phage” integrase family (Y4eF, fe5, fe3) 101939 102394 ISH-8b fe7 84% nt-id. to ISRm5 of R. meliloti; fe7: mutator family of transposases 115881 116004 MSH-14b partially homologous to ISH-14a 72-73% nt-id. to sequences downstream from mosaic element chvl/upstream from rpoN on the chromosome 124396 124500 MSH-14a partially homologous to ISH-14b 82% nt-id. to sequence RIME1 downstream from mosaic element chvI on the chromosome; parts of MSH-14a show 73-89% nt-id. to chromosomal sequences 126806 127369 ISH-12f y4gA low, similarity to y4rE 127900 128500 ISH-12e y4gC recombinase from pAE1 of Al. eutrophus (“phage” integrase family) 131000 131800 ISH-15 y4gE* partially 87% nt-id. to chromosomal sequences 159781 160564 ISH-16 96% nt-id. to repetitive sequence from R. fredii USDA257 (acc. no. M73698) 164600 167700 ISH-10a y4hNOPQ 99% nt-id. of parts of y4hPQ to chromosomal different ORFs derived from IS-like sequences; sequences partially known as acc. no. X74068 (“Region2” from pNGR234a); 164853-167086; 66% nt-id. to IS66 from Ag. tumefaciens 168208 169190 ISH-2c fi1, fi2, fi3 168343-168659: 72% nt-id. to 168208-168383: 70 nt-id. to ISRm2011-2 ISH-2f1/ISH-2d1 (R. meliloti); fi2/3: IS1111A, IS1328, IS1533 168785-169091: 73% nt-id. to ISH- family of transposases 2f2/ISH-2d2 173295 173702 ISH-8g y4iE* y4iE: sim. to y4rL, y4zA. and fr2 175590 175909 ISH-11c y4iG* 175643-175909: 91% nt-id. to ISH-11a 185672 186507 ISH-2d y4iO*/P* (3′- 185672-186075(−): 73% nt-id. to ISH- Y4iO: Tnp of IS1328 from Y. enterocolitica end) 2c2(+) (IS1111A, IS1328, IS1533 family) 186208-186507(−): 72% nt-id. to ISH- 2c1(+) 187112 189752 NGRIS-5a y4iQjA identical to NGRIS-5b/c copie(s) on the chromosome IstA and B (Tnps) of IS1326 from E coli (IS21/IS1162/IS408 family) 190000 193500 ISH-10c y4jBCD(E*) 38/32 aa-id. of y4jCD to y4hOP different ORFs derived from IS-like sequences; (ISH10a) partially 60% nt-id. to IS866 (Ag. tumefaciens); IS292 (Ag. radiobacter); ISR11 (R. leguminosarum) 193518 193634 MSH-17 76% nt-id. to repetitive sequence RMX6 from Myxococcus xanthus (acc. no. M60865) 199746 199958 ISH-11d y4jM* similarity to fb1 and y4bF (ISH-11a) 211165 211265 ISH-10g 74% nt-id. to ISR11 (R. leguminosarum), IS66/IS866 derivative 212350 212580 ISH-10h fk2 similar to y4jD (ISH-10c) 74% nt-id. to IS66 217564 220186 NGRIS-3b y4kIJ identical to NGRIS-3a/c copie(s) on the chromosome 62% nt-id. (over 2352 nt) to IS1162 of Ps. fluorescens (IS21/IS1162/IS408 family) 224547 224995 ISH-18a y4kQ 83% nt-id. to ISH18b (427651-428102) IS110 family (3′-end) 240800 241040 ISH-24b fl2 60% nt-id. to ISR12 from R. leguminosarum 244540 244851 ISH-19a fl4 244620-244812: 97% nt-id. to ISH-19b TnpA from Tn163 (R. leguminosarum) 248290 248655 ISH-19b fl5 248463-248655: 97% nt-id. to ISH-19a 248814 249680 ISH-20 fl6 Tnp of Tn1546 (Enterococcus faecium; Tn21/501/1721 family) 251407 252400 ISH-13b y41SmA y41S: similar to y4cG (ISH-13a) y41S: invertase; 58% nt-id. (251409-252211) to Tn501 from Ps. aeruginosa (acc. no. Z00027) 258551 258657 MSH-21 mosaic sequence: 82% nt-id. to sequence upstream of ropA2 (R. liguminosarum; acc. no. X80794) 280403 283043 NGRIS-5b y4nDE identical to NGRIS-5b/c copie(s) on the chromosome IstA and B (Tnps) of IS1326 from E. coli (IS21/IS1162/IS408 family) 284722 284985 ISH-1b fn2 60% nt-id. to IS1162 (Ps. fluorescens, IS21/IS1162/IS408 family) 300017 300819 ISH-1c fo3 61% nt-id. IS408 (Ps. cepacia; IS21/IS1162/IS408 family) 300820 304117 NGRIS-6 fo4/5/6, 77% nt-id. to NGRIS-4 y4oL/M/N 304118 304434 ISH-1d fo7 61% nt-id. IS408 (Ps. cepacia; IS21/IS1162/IS408 family) 318854 319686 NGRIS-7 fp1-2 66% nt-id. to IS1248 of Pa. denitrificans 320456 328935 NGRRS-1a fp3/4; y4pL 3 copies on the chromosome interrupted by NGRIS2a and 4b; fp3/4, Y4pL: IS21/IS1162/IS408 family 320590 323147 NGRIS-2a y4pEFG identical to NGRIS-2b partially 88-990% nt-id. to repetitive sequence RFRS9 of R. fredii USDA257 (IS1111A/IS1328/ IS1533 family) 323961 327276 NGRIS-4b y4pHIJK identical to NGRIS-4a many copies on the chromosome (disrupts all 4 copies of NGRRS-1) 335004 336301 NGRIS-8 y4pO similar to fe7 (ISH-8b) 88% nt-id. to ISRm3 of R. meliloti: mutator family of transposases 342272 342419 ISH-12d 342272-342419: 87% nt-id. to ISH- 12b4 344100 345300 ISH-2e y4qE Tnp (Leptospira borgpetersenii): IS1111A/IS1328/IS1533 family 345755 346133 ISH-12c fq4 345755-346133: 82% nt-id. to ISH- Int (XerC, E. coli): “phage” integrase family 12b5 351600 351735 MSH-22 80 nt-id. to sequence from pTiS4 (Ag. vitis; acc. no. M91609) 351826 353794 ISH-10d y4qI, fq5 fq5: 35% aa-id. to y4hQ (ISH-10a) 71-95% nt.-id. of parts of y4qI to chromosomal 67% nt-id. to ISR11 (R. leguminosarum; acc. no. sequences L19650); IS866/66 homolog 354000 363073 ISH-12b y4qJK, fq6, 354942-356123/356215-356383: 70% nt-id. of parts of ISH14a1 to chromosomal Tnp and Int from Weeksella zoohelcum -IS-element y4rABCDEF 90/91% nt-id. to ISH-12a1; sequences (y4qJK), different integrases (y4rAB), integrase 75% nt-id. to fe5 region (ISH-12a2): XerC of H. influenzae (y4rC); 359753-359968. 88% nt-id. to ISH- y4qK, fq6, y4rABCDEF: “phage” integrase family 12a3; 361029-361410: 82% nt-id. to ISH-12c 362507-362654: 87% nt-id. to ISH-12d 363287 363694 ISH-10i y4rG low similarity to y4jB and fr4 (ISH- unknown protein from IS1312 (A. tumefaciens)/ 10c/i) IS866 366252 367402 ISH-8f fr1, fr2 366252-366524: 88% nt-id. to ISH-8e 366773-366953: 92% nt-id. to ISH-8g 367699 367970 ISH-10c fr3 56% aa-id. of fr3 to y4hO (ISH-10a) 75% nt-id. to IS66 (Ag. tumefaciens) 368503 369675 ISH-23 y4rI 91-93% nt-id. of parts of y4rI to chromosomal sequences 369697 370887 ISH-2f y4rJ 370012-370328: 72% nt-id. to ISH-2c1 y4rJ: Tnp from IS1111a of Coxiella burnitii 370479-370785: 73% nt-id. to ISH-2c2 (IS1111A/IS1328/IS1533 family) 371399 372990 ISH-8e y4rL*M* 371399-371671: 88% nt-id. to ISH-8f 371474-372228: 97% nt-id. to ISH-8d 377185 377695 ISH-10j fr4 similar to y4rG (ISH-10i) 377327-377695: 75% nt-id. to ISRm6 (R. meliloti) 377826 380383 NGRIS-2b y4sABC identical to NGRIS-2a partially 88-90% nt-id. to repetitive sequence RFRS9 of R. fredii USDA257 (IS1111A/IS1328/ IS1533 family) 380883 383523 NGRIS-5c y4sDE identical to NGRIS-5a/b copie(s) on the chromosome IstA and B (Tnps) of IS1326 from E. coli (IS21/IS1162/IS408 family) 383593 384054 ISH-2g fs5 Tnp of IS1328 of Y. enterocolitica (IS1111A/IS1328/IS1533 family) 384210 384493 ISH-10k fragments with 94-84% nt-id. to ISRm6 (R. meliloti) 388100 388600 ISH-2h fs1 different Tnps (IS1111A/IS1328/IS1533 family) 388601 388900 ISH-10l fs2 ORF from IS1312 of Ag. tumefaciens (IS66/866 family) 396445 397301 NGRIS-9 y4sN and fs4 91-99% nt-id. of NGRIS9-parts to chromosomal different ORFs derived from IS elements; partially sequences known from acc. no. X74314 400626 403248 NGRIS-3c y4tAB identical to NGRIS-3a/b copie(s) on the chromosome 62% nt-id. (over 2352 nt) to IS1162 of Ps. fluorescens (IS21/IS1162/IS408 family) 426525 428102 ISH-18b y4uE* 427651-428102: 83% nt-id. to ISH-18a 77-96% nt-id. of ISH-18b-parts to chromosomal Tnp of mini-circle DNA from Str. coelicolor sequences (IS110 family) 429860 430007 ISH-8c fu3 85% nt-id. to ISRm5 (R. meliloti) 430568 432851 ISH-1e y4uHI 60% nt-id. to IS408/IS1162 (Ps. cepacia/Ps. fluorescens) 433222 433560 ISH-24a fu4 low similarity to y4sN (NGRIS-9) 79% nt-id. to ISRm4 (R. meliloti)/ISR12-like 462554 463053 ISH-10f fragments with 83-69% nt-id. to IS866 (Ag. tumefaciens) 524946 525892 ISH-8d y4zA 525095-525849: 97% ni-id. to ISH-8e 524946-525580: 61% nt-id. to ISRm5 (R. meliloti) 526051 527121 ISH-25 y4zB* Tnp of IS5376 from B. stearothermophilus (IS4 family of transposases) 530364 531249 ISH-1f fz4, fz2 79% nt-id. to part of NGRIS-5 fz4/2: IS21/IS1162/IS408 family

[0070] al., 1987) and to other non-symbiotic bacteria in fields (Sullivan et al., 1995) suggests that lateral transfer of genetic information has helped shape symbiotic potential.

[0071] Carbohydrates are constituents of the rhizobial cell wall as well as morphogens called Nod-factors (short tri- to penta-mers of N-acetyl-D-glucosamine, substituted at the non-reducing terminus with C16 to C18 saturated or partially unsaturated fatty acids). Elements of the biosynthetic pathways leading to cell walls or to lipo-chito-oligosaccharides (Nod-factors) are common. Most differences are found in the later stages of the pathways that lead to specific cell-wall components or to Nod-factors.

[0072] As befits a symbiotic replicon, only 13 ORF's with homology to polysaccharide synthesis genes (house-keeping genes senso stricto) are located on the plasmid (Table 3). Sequences homologous to exoB, exoF, exoK, exoL, exoP, exoU, and exoX (X. Perret and V. Viprey, unpublished), and exoY (Gray et al., 1990) are clearly located on the chromosome. Although loci with weak homologies to nod-box:-psiB of R. leguminosarum, and exoX of R. meliloti exist on the plasmid (y4iR, and y4xQ respectively), these are regulatory rather than structural genes, suggesting that almost all cell wall polysaccharide synthesis ORFs are chromosomally located.

[0073] Except for nodPQ and nodE, at least one copy of all the regulatory and structural ORFs involved in Nod-factor biosynthesis seem to be located on the plasmid. The activity of most nodulation genes is modulated by four transcriptional regulators of the lysR family. These are nodD1 (y4aL), syrM1 (y4pN), nodD2 (y4xH), and syrM2 (y4zF). NodC, which is an N-acetylglucosaminyltransferase, the first committed enzyme in the Nod-factor biosynthetic pathway, is part of an operon which includes nodABCIJnolOnoeIE (y4hI to y4hB, Table 3). Together, these genes, which form the hsnIII locus, are responsible for the synthesis of the core Nod-factor molecule, and the adjunction of 3- (or 4)-O-carbamoyl, 2-O-methyl, and 4-O-sulfate groups (Hanin et al., unpublished). nodZ (y4aH), which encodes a fucosyltransferase, is part of the hsnI locus, which includes noeJ (y4aJ), noeK (y4aI), noeL (y4aG), nolK (y4aF), all of which are involved in the fucosylation of NodNGR factors (Fellay et al., 1995a). Wild-type NodNGR factors are also N-methylated and 6-O-carbamoylated, adjuncts which are added by the transferases encoded by nodS and nodU respectively [y4nC and y4nB; hsnII (Lewin et al., 1990)]. Possibly the only other enzyme which may be directly involved in Nod-factor biosynthesis is that encoded by nolL (y4eH, Table 3). As the 2-O-methylfucose residue of NGR234 Nod-factors is either 3-O-acetylated, or 4-O-sulphated, an acetyltransferase is obviously required. Since NolL shows only limited homology to acetyltransferases, experimental proof of the transferase activity will be required however.

[0074] In contrast to R. leguminosarum and R. meliloti harbouring pNGR234a, A. tumefaciens (pNGR234a) transconjugants are incapable of nitrogen fixation (Broughton et al., 1984), suggesting that some essential fix-ORFs are also carried by the chromosome. Nevertheless, more than 40 nif- and fix-ORFs are plasmid borne. Included amongst these are nifA (y4uN) which encodes for a sigma-54 dependent regulator. Mutation of rpoN (which encodes sigma 54) causes a Fix⁻ phenotype on NGR234 hosts (van Slooten et al., 1990). Similarly, mutation of fixF (y4gN) disrupts synthesis of a rhamnose-rich extra-cellular polysaccharide, and results in a Fix⁻ phenotype on Vigna unguiculata, the reference host for NGR234 (unpublished). In fact, loci adjacent to fixF are probably responsible for the synthesis of dTDP-rhamnose from glucose-1-phosphate. Enzymes involved in this biosynthetic pathway include glucose-1-phosphate thymidylyltransferase (y4gH), dTDP-glucose-4,6-dehydratase (y4gF), dTDP-4-dehydrorhamnose-3,5-epimerase (y4gL), and dTDP-4-dehydrorhamnose reductase (y4gG). Rhamnose-rich lipopolysaccharides (LPS) seem to be necessary for complete bacteroid development and nitrogen fixation (Krishnan et al., 1995). Perhaps the enzyme encoded by y4gI is needed for the synthesis of the rhamnose rich LPS's from dTDP-rhamnose.

[0075] Although not directly involved in the fixation process, mutation of the plasmid borne copy of dctA (=dctA1, y4vF) also impairs nitrogen fixation (van Slooten et al., 1992). Other nif- and fix-ORFs are involved in elaboration of the electron-transfer complex (fixAB), in various cofactors required for nitrogen fixation (e.g. fixC, nifB, nifE, niJN, etc.), and in the synthesis of ferrodoxins (fdxB, fdxN, fixX). Finally, those ORFs involved in the synthesis of the nitrogenase complex are also present. Amongst these are two functional copies of the nifKDH ORFs (y4vM to y4vK and y4xC to y4xA) (Badenoch-Jones et al., 1989). Additionally, 17 new ORFs located within the nitrogen fixation cluster (see FIG. 7; ORFs y4vC to y4vJ with the exception of dctAl, y4wA to y4wG, y4wI, y4wJ and y4xQ) are co-transcribed together with the ORFs homologous to known nif and fix genes. It thus seems likely that most ORFs necessary for bacteroid development and synthesis of the nitrogen-fixing complex, are carried by pNGR234a.

[0076] Two types of regulatory elements which frequently occur in pNGR234a are the NodD- and NifA/sigma-54-dependent promoters. NodD-dependent promoter-like sequences known as nod boxes have been identified by homology search within intergenic regions, using the following consensus sequence: 5′-YATCCAYNNYRYRGATGNNNNYNATCNAAACAATCRATTTT ACCAATCY-3′ [12 mismatches allowed (van Rhijn and Vanderleyden, 1993); Y=C or T, R=A or G, N=A, C, G or T]. Putative NifA-dependent promoters (Fischer, 1994) have been predicted by screening for the NifA activator sequence (5′-TGT-N₁₀-ACA-3′) together with the sigma-54 promoter consensus sequence (5′-TGGCAC-N₅-TTGCA/T-3′ with GG and GC as the most conserved doublets; 3 mismatches allowed) separated by 60 to 150 nucleotides. The identified conserved promoter-like sequences in pNGR234a are listed in Tables Sand 6. TABLE 5 nod box-like sequences in pNGR234a number of mismatches to the nod position in orien- consensus distance to the name of the box pNGR234a tation sequence following ORF following ORF  1 4514-4562 − 11 504 (fa1)  2 8481-8529 − 8 87 nodZ  3 12322-12370 − 7 — ?#  4 97470-97518 − 6 277 nolL  5 129615-129663 + 10 1358 y4gE  6 141088-141136 + 8 890 fixF  7 150280-150327 − 11 202 noeE  8 158820-158868 − 4 235 nodA  9 161891-161939 + 11 1103 y4hM 10 169833-169881 − 7 117 y4iR 11 278947-278995 − 7 153 nodS 12 279821-279869 + 7 — ?# 13 443101-443149 − 10 465 y4vC 14 473059-473107 + 9 236 y4wH  15° 481253-481301 − 16 117 y4wM 16 493961-494009 + 6 288 y4xI 17 532039-532087 + 5 589 syrM2 18 256434-256482 + 12 329 y4mC 19 469151-469199 + 12 112 y4wE

[0077] TABLE 6 Putative NifA-dependent promoters in pNGR234a sigma-54 promoter distance to the NifA-dep. UAS*: (−12/−24 region#): orien- following ORF name of the Nr. position position tation (nt) following ORF  1 90812-90827 90910-90924 + 127  y4eD  2 162727-162742 162788-162802 + 240  y4hM  3 235036-235051 234934-234948 − 66 y4lD  4 255021-255036 255130-255144 + 306  y4mB  5 285265-285280 285343-285357 + 50 y4nG  6 436363-436378 436275-436289 − 41 nifB  7 442046-442061 441955-441969 − 56 fixA  8 442735-442750 442676-442690 − 40 y4vC  9 444109-444124 443983-443997 − 104  y4vD 10 444137-444152 444241-444299° +  38° nifQ 11 451782-451799 451891-451905 + 88 nifH1 12 460319-460334 460424-460438 + 63 y4vR 13 463063-463078 463139-463153 + 48 y4wA 14 478839-478854 478761-478775 − 463  nifS 15° 483663-483678 483769-483783 + 88 nifH2

EXAMPLES Example 1 GENERAL METHODS

[0078] Bacteria and Plasmids

[0079] Escherichia coli was grown on SOC, in TB or in two-fold YT medium (Sambrook et al., 1989). The cosmid clones pXB296 and pXB110 (Perret et al., 1991) were raised in E. coli strain 1046 (Cami and Kourilsky, 1978). Subclones in M13mp18 vectors (Yanisch-Perron et al., 1985) were grown in E. coli strain DH5αF′IQ (Hanahan, 1983).

[0080] Construction of Cosmid Libraries

[0081] Cosmid DNA was prepared by standard alkaline lysis procedures followed by purification in CsCl gradients (Radloff et al., 1967). DNA fragments sheared by sonication of 10 μg of cosmid DNA were treated for 10 min at 30° C. with 30 units of mung bean nuclease (New England Biolabs, Beverly, Mass., USA), extracted with phenol/chloroform (1:1), and precipitated with ethanol. DNA fragments, ranging in size from 1 to 1.4 kbp, were purified from agarose gels using Geneclean II (Bio101, Vista, Calif., USA) and ligated into SmaI-digested M13pm18. Electroporation of aliquots of the ligation reaction into competent E. coli DH5αF′IQ was performed according to standard protocols (Dower et al., 1988; Sambrook et al., 1989).

[0082] M13 Template Preparation

[0083] Fresh 1 ml E. coli cultures in twofold YT held in 96-deep-well microtiter plates (Beckman Instruments, Fullerton, Calif., USA) were infected with recombinant phages from white plaques grown on plates containing X-gal (5-bromo-4-chloro-indoyl-β-D-galactoside) and IPTG (isopropyl-β-thiogalactopyranoside). Rapid preparation of -0.5 Mg of single-stranded M13 template DNA was carried out as follows: 190 μl portions of the phage cultures grown for 6 hr at 37° C. were transferred into 96-well microtiter plates. Lysis of the phages was obtained by adding 10 μl of 15% (w/v) SDS followed by 5 min incubation at 80° C. Template DNA was trapped using 10 μl (1 mg) of paramagnetic beads (Streptavidin MagneSphere Paramagnetic Particles Plus M13 Oligo, Promega, Madison, Wis., USA) and 50 μl of hybridization solution [2.5 M NaCl, 20% (w/v) polyethylene glycol (PEG-8000)] during an annealing step of 20 min at 45° C. Beads were pelleted by placing microtiter plates on appropriate magnets and washing three times with 100 μl of 0.1-fold SSC. The DNA was recovered in 20 μl of water by a denaturation step of 3 min at 80° C. When required, larger amounts of single-stranded recombinant DNA (>10 μg) were purified using QIAprep 8 M13 Purification Kits (Qiagen, Hilden, Germany) from 3 ml of supernatant of phage cultures grown for 6 hr at 37° C.

[0084] Sequencing

[0085] Two sequencing methods were used: dye terminator and dye primer cycle sequencing, each in combination with AmpliTaq DNA polymerase (Perkin-Elmer) and Thermo Sequenase (Amersham). All reactions, including ethanol precipitation, were performed in microtiter plates. Reagents were pipetted using 12-channel pipettes. Where necessary, sequencing reaction mixtures, including enzymes, were pipetted into the plates in advance and held at -20° C. until needed.

[0086] Dye Terminator Cycle Sequencing

[0087] For dye terminator/AmpliTaq DNA polymerase sequencing, 0.5 μg of template DNA, and the PRISM Ready Reaction DyeDeoxy Terminator cycle sequencing Kit (Perkin-Elmer) were used. Cycle sequencing was performed in microtiter plates using 25 PCR cycles (30 sec at 95° C., 30 sec at 50° C., and 4 min at 60° C). Prior to loading the amplified products on electrophoresis gels, unreacted dye terminators were removed using Sephadex columns scaled down to microtiter plates (Rosenthal and Charnock-Jones, 1993).

[0088] Dye terminator/Thermo Sequenase sequencing was performed using the same experimental conditions except that the reaction mix contained 16.25 mM Tris-HCl (pH 9.5), 4.0 mM MgCl₂, 0.02% (v/v) NP-40, 0.02% (v/v) Tween 20, 42 μM 2-mercaptoethanol, 100 μM dATP/dCTP/dTTP, 300 μM dITP, 0.017 μM A/0.137 μM C/0.009 μM G/0.183 μM T from Taq Dye Terminators (Perkin-Elmer; no. A5F034), 0.67 μM primer, 0.2-0.5 μg of template DNA, and 10 units of Thermo Sequenase (Amersham) in a 30 μl reaction volume. Unincorporated dye terminators were removed from reaction mixtures by precipitation with ethanol.

[0089] Dye Primer Cycle Sequencing

[0090] Dye primer/AmpliTaq DNA polymerase sequencing reactions were performed according to the instructions accompanying the Taq Dye Primer, 21M13 Kit (Perkin-Elmer). Cycle sequencing was carried out on 0.5 μg of template DNA with 19 PCR cycles (30 sec at 95° C., 30 sec at 50° C., and 90 sec at 72° C.) followed by six cycles, each consisting of 95° C. for 30 sec and 72° C. for 2.5 min. Prior to electrophoresis, the four base-specific reactions were pooled and precipitated with ethanol.

[0091] Identical PCR conditions and the Thermo Sequenase Fluorescent Labelled Primer Cycle Sequencing Kit (Amersham) were used for dye primer/Thermo Sequenase sequencing reactions.

[0092] Sequence Acquisition and Analysis

[0093] Gel electrophoresis and automatic data collection were performed with ABI 373A DNA sequencers (Perkin-Elmer). After removing cosmid vector and M13mp18 sequences from the shotgun sequence data, the data were assembled using the program XGAP (Dear and Staden, 1991) and edited against the fluorescent traces. To close remaining gaps, to make single-stranded regions double-stranded, and to clarify ambiguities, additional cycle sequencing reactions with selected shotgun templates were carried out using either custom-made primers (primer-walks) or universal primer.

[0094] The complete double-stranded DNA sequence of cosmid pXB296 was analyzed using programs from the Wisconsin Sequence Analysis Package (version 8, Genetics Computer Group, Madison, Wis., USA). Homology searches were performed with BLAST (version 1.4; Altschul-et al., 1990) and FASTA (version 2.0; Pearson and Lipman, 1988). Several nucleotide and protein databases were screened (GenBank/Genpept, SwissProt, EMBL, and PIR). Identities and similarities between homologous amino acid sequences were calculated with the alignment program BESTFIT (Smith and Waterman, 1981).

Example 2

[0095] Comparison of Fluorescent Traces Created by Different Cycle Sequencing Methods

[0096] When using a thermostable sequenase [Thermo Sequenase (Amersham)], the concentrations of dye terminators (Perkin-Elmer) can be reduced by 20- to 250-fold in comparison to the concentrations needed for Tag DNA polymerase without compromising the quality of the sequencing results (Table 7).

[0097] To compare the dye terminator and dye primer cycle sequencing procedures, representative templates derived from the pXB296 library were sequenced by both methods, each performed with Thermo Sequenase and Taq DNA polymerase TABLE 7 Concentrations (in μM) of dye terminators in each cycle sequencing reaction with two different thermostable DNA polymerases Dye AmpliTaq DNA Thermo Sequenase Dilution factor for terminator polymerase DNA polymerase dye terminators^(a) A Taq 0.751 0.017  40 C Taq 22.500 0.137 160 G Taq 0.200 0.009  20 T Taq 45.000 0.183 250

[0098] (FIG. 1). In general, dye terminator traces do not contain the many compressions (on average, one compression every 50 bases in single reads) that are common with dye primers if mixes do not contain nucleotide analogues like deoxyinosine or 7-deaza-deoxyguanosine triphosphates or if sequencers are used without active heating systems. In addition, dye terminator traces obtained with Thermo Sequenase show more uniform signal intensities over those obtained with Taq DNA polymerase, thus resulting in a reduced number of weak and missing peaks (e.g. a weak G-signal following an A-signal in Thermo Sequenase traces or a weak C-signal following a G-signal in Taq DNA polymerase traces). Using ABI 373A sequencers, errors in automatic base-calling of Thermo Sequenase/dye terminator scans only arise after 300-350 bases. The average number of resolved bases in dye primer gels (378 bases) is 46 bases longer than in those produced with dye terminators (332 bases). Furthermore, in Thermo Sequenase/dye primer sequences the peaks are very regular and the number of stops and missing bases decreases in comparison to Tag DNA polymerase/dye primer electropherograms. The number of compressions, however, is not significantly reduced.

Example 3

[0099] Shotgun Sequencing of Entire Cosmids Using Dye Terminators or Dye Primers

[0100] To compare the efficiency of both methods, cosmid pXB296 of pNGR234a was shotgun sequenced using a combination of dye terminators and thermostable sequenase (Thermo Sequenase), whereas another cosmid, pXB110, was sequenced using a combination of dye primers and Taq DNA polymerase (Table 1). Over 93% (736 clones) of 786 dye terminator reads of pXB296 were accepted by XGAP with a maximal alignment mismatch of 4%. By increasing this level to 25%, so that most of the remaining data could be included in the assembly, 775 reads led to three 6 to 10 kbp stretches of contiguous sequence (contigs), two of which were joined after editing. To close the last gap and to complete single-stranded regions with data derived from the opposite strand, only 32 additional dye terminator reads using custom-made primers were required. It took <1 week to assemble and finalize the 34,010 bp DNA sequence of pXB296 (EMBL accession no. Z68203; eight-fold redundancy; GC content, 58.5 mol %).

[0101] In contrast, only 308 (34%) of 899 shotgun reads obtained by Taq DNA polymerase/dye primer cycle sequencing of pXB110 were included in the first assembly (4% alignment mismatch). At the 25% alignment mismatch level, 879 reads were assembled, leading to 25 short contigs (1-2 kbp). These contigs had to be edited extensively in order to join most of them. “Primer walks”, covering gaps and complementing single-stranded regions, were not sufficient to clarify all the remaining ambiguities in the assembled sequence. Every 100-150 bp, a compression in one strand could not be resolved by sequence data from the complementary strand. Therefore, it was necessary to resequence clones using dye terminators and universal primer. In total, 191 additional dye terminator reads had to be created. As a result, assembling and finalizing the 34,573 bp sequence of pXB110 (10.5-fold redundancy; GC content, 58.3 mol %) took much more time than pXB296 did.

Example 4

[0102] Analysis of Cosmid pXB296

[0103] Putative ORFs were located on the 34,010 bp sequence of pXB296 using the programs TESTCODE (Fickett, 1982) and CODONPREFERENCE (Gribskov et al., 1984), the latter in combination with a codon frequency table based on previously sequenced genes of Rhizobium sp. NGR234 (as well as the closely related R. fredii). All 28 ORFS and their deduced amino acid sequences exhibited significant homologies to known genes and/or proteins. The positions of the ORFs along pXB296, as well as the best homologues, are displayed in Table 2 and FIG. 2. Ribosomal binding site-like sequences (Shine and Dalgarno, 1974) precede each putative ORF except for ORF9 (position 11,214-12,455). If one disregards the homology to known glutamate dehydrogenases in the first 32 amino acids deduced from this ORF, a downstream alternative start codon (position 11,220) preceded by a Shine-Dalgarno sequence can be identified. Most of the ORFs are organised in five clusters (ORFs with only short intergenic spaces or overlaps between them). Cluster I, containing ORF1 to ORF5, encodes proteins homologous to trans-membrane and membrane-associated oligopeptide permease proteins and to a Bacillus anthracis encapsulation protein. Cluster II, includes ORF6 and ORF7, which are homologous to aminotransferase and (semi)aldehyde dehydrogenase genes. Homologies to transposase genes [ORF8; cluster III (ORF10 and ORF11)] and to various nif and fix genes [cluster IV (ORF12 to ORF20); ORF23, part of cluster V] are also reported.

[0104] Presumed promoter and stem-loop sequences that might represent ρ-independent terminator-like structures (Platt, 1986) are shown in FIG. 2. Significant σ⁵⁴-dependent promoter consensus sequences (5′-TGGCACG-N₄-TTGC-3′; Morett and Buck, 1989), as well as nifA upstream activator sequences (5′-TGT-N₁₀-ACA-3′; Morett and Buck, 1988), are found upstream of the nifB homologue ORF15, the fixA homologue ORF20, ORF21, ORF22, and ORF23. ORF23 is part of cluster V in pXB296, which includes the dctA gene of Rhizobium sp. NGR234 (van Slooten et al., 1992). Surprisingly, the published dctA sequence shows important discrepancies. Therefore, a fragment encompassing this locus was amplified by PCR using NGR234 genomic DNA as template. By sequencing this fragment, the cosmid sequence of the present invention was confirmed.

Example 5

[0105] Analysis of the Complete Plasmid pNGR234a

[0106] Using the thermostable sequenase/dye terminator cycle sequencing method herein described, 20 overlapping cosmids (including pXB296) of the symbiotic plasmid pNGR234a of Rhizobium sp. NGR234 were sequenced, together with two PCR products and a subcloned DNA fragment derived from cosmid pXB564 that cover two remaining gaps (position 276,448-277,944 and position 480,607-483,991). The map of the sequenced cosmids is shown in FIG. 4. The entire assembled 536 kb sequence of pNGR234a is given in FIG. 3 (deposited in EMBL/GenBank under accession no. U00090).

[0107] The analysis of the complete nucleotide sequence revealed few regions of 98-100% identity to already published sequences in public databases. These sequences are listed in Table 8. These sequences had been derived either from Rhizobium sp. NGR234, derivatives of it or closely related strains of it. Therefore, the ORFs and their deduced proteins, 98-100% homologous to nifH, nodA, nodB, nodC, nodD1, nodS, nodU, nolX, nolW, nolB, nolU and “ORF1”, represent already known genes/proteins (Table 8 and References). Some other ORFs and their deduced proteins, nearly identical to public database entries, were either only partially known before the disclosure of the present invention or exhibited significant differences, for instance, dctA, host-inducible gene A, nifD, nifK, nodD2, nolT, nolX, nolV, “ORF140”, “ORF91”, “RSRS9 25 kDa-protein gene” (Table 8 and References).

[0108] As a first step, approximately 100 kb of pNGR234a was analyzed between position 417,796 to 517,279 using the programs TESTCODE (Fickett, 1982) and CODONPREFERENCE (Gribskov et al., 1984). In this initial ˜100 kb of sequence, 76 ORFs were found and ascribed putative functions TABLE 8 All ORFs that show 98-100% identity in the nucleotide sequence to ORFs located in pNGR234a and that have already been published in databases: EMBL/GeneBank + claimed in the patent application/ ORF organism accession no. − not claimed in the patent application dctA Rhizobium sp. NGR234 S38912 + sequencing mistakes in the database entry: the real dctA in pNGR23a is 144 bases longer (see table 4) host inducible geneA Rhizobium fredii USDA 201# M19019 RFIND + significant difference in pNGR234a (frameshift; see table 4) nifH Rhizobium sp. ANU 240* M26961 RHMNIFKDH3 − nifD (partially) Rhizobium sp. ANU 240* M26961 RHMNIFKDH2 + only part of nifD is in the public database nifK (partially) Rhizobium sp. ANU 240* M26961 RHMNIFKDH1 + only part of nifK is in the public database nodABC Rhizobium fredii USDA 257# M73362 RSNOD2 − nodD1 Rhizobium sp. mpik 3030* Y00059 RSNODD1 − nodD2 Rhizobium japonicum USDA 191# M18972 RHMNODD2M + significantly different function of NodD2 in NGR234 than in USDA 191 (despite of 98% identity °) nodS Rhizobium sp. NGR234 J03686 NGRNOIDSU − nodU (partially) Rhizobium sp. NGR234 J03686 NGRNODSU − nodU (full) Rhizobium sp.* X89965 RSNODUGEN nolXWBTUV Rhizobium fredii USDA 257# L12251 RHMNOLBTU − nolXWB, nolU + NolT: 97% identical (amino acid sequence level) + NolX, NolV + ORF4 of pNGR234a show significant differences to USDA 257 (see table 4) ORF1; ORF2(partially) Rhizobium sp. NGR234 X74314 RSORF − ORF140 nodulation gene; Rhizobium sp. NGR234 X74068 RSPLAS + database entry includes sequencing mistakes ORF91(partially) causing frameshifts RFRS9 25kDa protein gene* Rhizobium fredii USDA 257# U18764 RFU18764 + repetitive element in pNGR234a showing insertions, deletions of nucleotides in comparison to the database entry

[0109] (=ORFs y4tQ to y4yO (excluding ORFs y4uD, y4uG, y4wG, y4wO, y4wP, y4xF, y4xQ, y4xG and y4yB and excluding ORF-fragments fu1, fu2, fu3, fu4, fv1 and fw1); see Table 3). It should be noted that since the sequence of cosmid pXB296 forms part of this 100 kb region, all of the ORFs identified in Table 2 (except “ORF1”) are reproduced (albeit with minor, but definitive, revisions) in Table 3. Most of the 76 ORFs and their deduced proteins showed homologies to public database entries that could help identify their putative functions. Only ORFs y4vK and y4xA (duplicated nifH) as well as y4yD, y4yE and y4yG (nolW, nolB and nolU) were identical to database entries (98-100% homology). In the case of 7 ORFs and their deduced proteins, no homologous sequences in public databases have been found.

[0110] As a second step, the remaining 436 kb of pNGR234a were analyzed using the methods noted above. The results of this analysis are discussed in Example 6.

Example 6

[0111] Genetic Organization of the Complete Plasmid pNGR234a

[0112] In order to confirm and to improve the identification of probable coding regions in pNGR234a, the program GeneMark was used which is based on matrices developed for related organisms of Rhizobium sp. NGR234 (R. leguminosarum and R. meliloti (Borodovsky et al., 1994)). The use of this program currently represents the most frequently applied method to distinguish coding and non-coding regions in newly sequences DNA of prokaryotes. Further analysis of the putative ORF products was carried out using methods to detect signal sequences, transmembrane segments and various other domains (PROSITE database search (Bairoch et al., 1995); PSORT program (Nakai et al., 1991)).

[0113] In total, 416 ORFs were predicted to encode putative proteins (Freiberg et al., 1997). Additionally, 67 fragments were detected that seemed to be remnants of functional ORFs. Some of these were disrupted by insertion of mobile elements. All identified functional ORFs and fragments of former functional ORFs are listed in Table 3.

[0114] Within the initial ˜100 kb region (position 417,796 to 517,279) first analyzed in this study, 9 ORFs (y4uD, y4uG, y4wG, y4wO, y4wP, y4xF, y4xQ, y4xG and y4yB) and 6 ORF-fragments (fu1, fu2, fu3, fu4, fv1 and fw1) were predicted in addition to the 76 ORFs (y4tQ to y4yO) listed within Table 3.

[0115] According to Table 8, 12 ORFs of the 416 predicted coding regions were identical to public database entries (98% to 100% homology at the amino acid level), namely: y4hI (nodA), y4hH (nodE), y4hG (nodC), y4aL (nodD1), y4nC (nods), y4nB (nodU), y4sM (ORF1), y4vK (niftH), y4xA (nifH2), y4yD (nolW), y4yE (nolB), y4yG (nolU). In addition, the database entry of the homologue to y4yC (nolX) has been corrected to 98% identical to y4yC. Furthermore, the sequence of the ORF y4hB (noeE) has been available to the public since October 1996. Except the 14 ORFs mentioned above, the remaining 402 ORFs are new. 139 of them show no homology to any known ORF/protein. The others exhibit less than 98% amino acid identity to public database entries over their whole length.

[0116] INDUSTRIAL APPLICABILITY

[0117] The present invention provides a detailed analysis of the symbiotic plasmid pNGR234a of Rhizobium sp. NGR234. The plasmid pNGR234a (including any ORFs encoded therein, or any part of the nucleotide sequence of the plasmid, or any proteins expressible from any of said ORFs or any part of said nucleotide sequence) has industrial applicability which can include its use in, inter alia, the following areas:

[0118] (a) the analysis of the structure, organisation or dynamics of other genomes;

[0119] (b) the screening, subcloning, or amplification by PCR of nucleotide sequences;

[0120] (c) gene trapping;

[0121] (d) the identification and classification of organisms and their genetic information;

[0122] (e) the identification and characterisation of nucleotide sequences, amino acid sequences or proteins;

[0123] (f) the transportation of compounds to and from an organism which is host to at least to one of said nucleotide sequences, ORFs or proteins;

[0124] (g) the degradation and/or metabolism of organic, inorganic, natural or xenobiotic substances in a host organism;

[0125] (h) the modification of the host-range, nitrogen fixation abilities, fitness or competitiveness of organisms;

[0126] (i) obtaining a synthetic minimal set of ORFs required for functional Rhizobium-legume symbiosis;

[0127] (j) the modification of the host-range of rhizobia;

[0128] (k) the augmentation of the fitness or competitiveness of Rhizobium sp. NGR234 in the soil and its nodulation efficiency on host plants;

[0129] (l) the introduction of desired phenotype(s) into host plants using said plasmid as a stable shuttle system for foreign DNA encoding said desired phenotype(s); or

[0130] (m) the direct transfer of said plasmid into rhizobia or other microorganisms without using other vectors for mobilization.

[0131] REFERENCES

[0132] Altschul, S. F., G. Warren, W. Miller, E. M. Myers, and D. J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403-410.

[0133] Appelbaum, E. R., D. V. Thompson, K. Idler and N. Chartrain. 1988. Rhizobium japonicum USDA1 191 has two nodD genes that differ in primary structure and function. J. Bacteriol. 170: 12-20.

[0134] Badenoch-Jones, J., T. A. Holton, C. M. Morrison, K. F. Scott and J. Shine. 1989. Structural and functional analysis of nitrogenase genes from the broad host-range Rhizobium strain ANU240. Gene 77: 141-153.

[0135] Bender, G. L., M. Nayudu, K. K. L. Strange and B. G. Rolfe. 1988. The nodD1 gene from Rhizobium strain NGR234 is a key determinant in the extension of host-range to the non-legume Parasponia. Mol. Plant-Microbe Interact. 1: 259.

[0136] Bodmer, W. F. 1994. The Human Genome Project. Rev. Invest. Clin. (Suppl.) 3-5.

[0137] Broughton, W. J., M. J. Dilworth, and I. K. Passmore. 1972. Base ratio determination using unpurified DNA. Anal. Biochem. 46: 164-172.

[0138] Broughton, W. J., N. Heycke, H. Meyer z. A., and C. E. Pankhurst. 1984. Plasmid-linked nif and “nod” genes in fast growing rhizobia that nodulate Glycine max, Psophocarpus tetragonolobus, and Vigna unguiculata. Proc. Natl. Acad. Sci. USA. 81: 3093-3097.

[0139] Broughton, W. J. C-H. Wong, A. Lewin, U. Samrey, H. Myint, H. Meyer z. A., D. N. Dowling, and R. Simon. 1986. Identification of Rhizobium plasmid sequences involved in recognition of Psophocarpus, Vigna, and other legumes. J. Cell Biol. 102: 1173-1182.

[0140] Buikema, W. J., W. W. Szeto, P. V Lemley, W. H. Orme-Johnson, and F. M. Ausubel. 1985. Nitrogen fixation specific regulatory genes of Klebsiella pneumoniae and Rhizobium meliloti share homology with the general nitrogen regulatory gene ntrC of K. pneumoniae. Nucleic Acids Res. 13: 4539-4555.

[0141] Cami, B. and P. Kourilsky. 1978. Screening of cloned recombinant DNA in bacteria by in situ colony hybridization. Nucleic Acids Res. 5: 2381-2390.

[0142] Craxton, M. 1993. Cosmid sequencing. Methods Mol. Biol. 23: 149-167.

[0143] Dear, S. and R. Staden. 1991. A sequence assembly and editing for efficient management of large projects. Nucleic Acids Res. 19: 3907-3911.

[0144] Davis, E. O. and A. W. B. Johnston. 1990. Regulatory functions of the 3 nodD genes of Rhizobium leguminosarum bv. phaseoli. Mol. Microbiol. 4: 933-941.

[0145] Dower, W. J., J. F. Miller, and C. W. Ragsdale. 1988. High efficiency transformation of E. coli by high voltage electroporation. Nucleic Acids Res. 16: 6127-6145.

[0146] Fellay, R., P. Rochepeau, B. Relić, and W. J. Broughton. 1995. Signals to and emanating from Rhizobium largely control symbiotic specificity. In Pathogenesis and host specificity in plant diseases. Histopathological, biochemical, genetic, and molecular bases (ed. U. S. Singh, R. P. Singh, and K. Kohmoto), Vol. I, pp. 199-220. Pergamon/Elsevier Science Ltd., Oxford, U. K.

[0147] Fickett, J. W. 1982. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 10: 5303-5318.

[0148] Fischer, H.-M. 1994. Genetic regulation of nitrogen fixation in Rhizobia. Microbiol. Rev. 58: 352-386.

[0149] Fisher, R. F. and S. R. Long. 1993. Interactions of NodD at the nod box: NodD binds to two distinct sites on the same face of the helix and induces a bend in the DNA. J. Mol. Biol. 233: 336-348.

[0150] Fleischmann, R. D., M. D. Adams, O. White, R. A. Clayton, E. F. Kirkness, A. R. Kerlavage, C. J. Bult, J. F. Tomb, B. A. Dougherty, J. M. Merrick, et al. 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269: 496-512.

[0151] Fraser, C. M., J. D. Gocayne, O. White, M. D. Adams, R. A. Clayton, R. D. Fleischmann, C. J. Bult, A. R. Kerlavage, G. Sutton, J. M. Kelley, et al. 1995. The minimal gene complement of Mycoplasma genitalium. Science 270: 397-403.

[0152] Freiberg, C., X. Perret, W. J. Broughton and A. Rosenthal. 1996. Sequencing the 500-kb GC-rich symbiotic replicon of Rhizobium sp. NGR234 using dye terminators and a thermostable sequenase: A beginning. Genome Research, in press.

[0153] Gribskov, M., J. Devereux, and R. R. Burgess. 1984. The codonpreference plot: Graphic analysis of protein coding sequences and prediction of gene expression. Nucleic Acids Res. 12: 539-549.

[0154] Hanahan, D. 1983. Studies on transformation of Escherichia coli with plasmids. J. Mol. Biol. 166: 557-580.

[0155] Hartl, D. L. and M. J. Palazzolo. 1993. Drosophila as a model organism in genome analysis. In Genome research in molecular medicine and virology (ed. K. W. Adolf), pp. 115-129. Academic Press, Orlando, Fla., U.S.A.

[0156] Hiles, I. D., M. P. Gallagher, D. J. Jamieson, and C. F. Higgins, 1987. Molecular characterization of the oligopeptide permease of Salmonella typhimurium. J. Mol. Biol. 195: 125-142.

[0157] Iismaa, S. E., P. M. Ealing, K. F. Scott, and J. M. Watson. 1989. Molecular linkage of the nif/fix and nod gene regions in Rhizobium leguminosarum biovar trifolii. Mol. Microbiol. 3: 1753-1764.

[0158] Levy, J. 1994. Sequencing the yeast genome: An international achievement. Yeast 10: 1689-1706.

[0159] Lewin, A., E. Cervantes, C.-H. Wong and W. J. Broughton. 1990. nodSU, two new nod genes of the broad host range Rhizobium strain NGR234 encode host-specific nodulation of the tropical tree Leucaena leucocephala. Mol. Plant Microbe Interact. 3: 317-326.

[0160] Long, S. R. 1989. Rhizobium-legume nodulation: life together in the underground. Cell 56: 203-214.

[0161] Long, S., J. W. Reed, J. Himawan and G. C. Walker. 1988. Genetic analysis of a cluster of genes required for synthesis of the calcofluor-binding exopolysaccharide of Rhizobium meliloti. J. Bacteriol. 170: 4239-4248.

[0162] Makino, S.-I., I. Uchida, N. Terakado, C. Sasakawa, and M. Yoshikawa. 1989. Molecular characterization and protein analysis of the cap region, which is essential for encapsulation in Bacillus anthracis. J. Bacteriol 171: 722-730.

[0163] Martinez, E., D. Romero, and R. Palacios. 1990. The Rhizobium genome. Crit. Rev. Plant Sci. 9: 59-93.

[0164] Morett, E. and M. Buck. 1988. NifA-dependent in vivo protection demonstrates that the upstream activator sequence of nif promoters is a protein binding site. Proc. Natl. Acad. Sci. USA. 85: 9401-9405.

[0165] Morett, E. and M. Buck. 1989. In vivo studies on the interaction of RNA polymerase-a with the Klebsiella pneumoniae and Rhizobium meliloti nifH promoters: The role of nifA in the formation of an open promoter complex. J. Mol. Biol. 210: 65-77.

[0166] Padmanabhan, S., R.-D. Hirtz, and W. J. Broughton. 1990. Rhizobia in tropical legumes: Cultural characteristics of Bradyrhizobium and Rhizobium sp. Soil Biol. Biochem. 22: 23-28.

[0167] Pearson, W. R. and D. J. Lipman. 1988. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. 85: 2444-2448.

[0168] Perego, M., C. F. Higgins, S. R. Pearce, M. P. Gallagher, and J. A. Hoch. 1991. The oligopeptide transport system of Bacillus subtilis plays a role in the initiation of sporulation. Mol. Microbiol. 5: 173-185.

[0169] Perret, X., W. J. Broughton, and S. Brenner. 1991. Canonical-ordered cosmid library of the symbiotic plasmid of Rhizobium species NGR234. Proc. Natl. Acad. Sci. USA. 88: 1923-1927.

[0170] Perret, X., R. Fellay, A. J. Bjourson, J. E. Cooper, S. Brenner, and W. J. Broughton. 1994. Subtraction hybridization and shotgun sequencing: A new approach to identify symbiotic loci. Nuclei Acids Res. 22: 1335-1341.

[0171] Platt, T. 1986. Transcription termination and regulation of gene expression. Annu. Rev. Biochem. 55: 339-372.

[0172] Radloff, R., W. Bauer, and J. Vinograd. 1967. A dye-buoyant-density method for the detection and isolation of closed circular duplex DNA: The closed circular DNA in HELA cells. Proc. Natl. Acad. Sci. USA. 57: 1514-1521.

[0173] Relic, B., X. Perret, M. T. Estrada-Garcia, J. Kopcinska, W. Golinowski, H. B. Krishnan, S. G. Pueppke and W. J. Broughton. 1994. Nod factors of Rhizobium are a key to the legume door. Mol. Microbiol. 13: 171-178.

[0174] Rosenthal, A. and D. S. Charnock-Jones. 1993. Linear amplification sequencing with dye terminators. Methods Mol. Biol. 23: 281-296.

[0175] Sambrook, J., E. F. Fritsch, and T. Maniatis. 1989. Molecular cloning: A laboratory manual, 2nd ed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y., U.S.A.

[0176] Shine, J. and L. Dalgarno. 1974. The 3′-terminal sequence of Escherichia coli 16S ribosomal RNA: Complementary to nonsense triplets and ribosome binding sites. Proc Natl. Acad. Sci. 71: 1342-1346.

[0177] Smith, T. F. and M. S. Waterman. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147: 195-197.

[0178] Stanfield, S., L. Ielpi, D. O'Brochta, D. R. Hesinki and G. S. Ditta. 1988. The ndvA gene product of Rhizobium meliloti is required for Beta (1-2) glucan production and has homology to the ATP binding export protein HlyB. J. Bacteriol. 170: 3523-3530.

[0179] Sulston, J, Z. Du, K. Thomas, R. Wilson, L. Hillier, R. Staden, N. Halloran, P. Green, J. Thierry-Mieg, L. Qiu, et al. 1992. The C. elegans genome sequencing project: A beginning. Nature 356: 37-41.

[0180] Tabor, S. and C. C. Richardson. 1995. A single residue in DNA polymerases of the Escherichia coli DNA polymerase I family is critical for distinguishing between deoxy- and dideoxyribonucleotides. Proc. Natl. Acad. Sci. 92: 6339-6343.

[0181] van Rhijn, P. and J. Vanderleyden. 1995. The Rhizobium-plant symbiosis. Microbiol. Rev. 59: 124-142.

[0182] van Slooten, J. C., T. V. Bhuvanasvari, S. Bardin, and J. Stanley. 1992. Two C₄-dicarboxylate transport systems in Rhizobium sp. NGR234: Rhizobial dicarboxylate transport is essential for nitrogen fixation in tropical legume symbioses. Mol. Plant Microbe Interact. 5: 179-186.

[0183] Yanisch-Perron, C., J. Ira, and J. Messing. 1985. Improved M13 phage cloning vectors and host strains: Nucleotide sequences of M13mp18 and pUC19 vectors. Gene 33: 103-119.

[0184] Bairoch A., P. Bucher, and K. Hofmann. 1995. The prosite database, its status in 1995. Nucleic Acids Res., 24 189.

[0185] Borodovsky, M. Y., K. E. Rudd and E. V. Koonin. 1994. Intrinsic and extrinsic approaches for detecting genes in a bacterial genome Nucleic Acids Res. 22: 4756.

[0186] Broughton, W. J., U. Samrey, and J. Stanley. 1987. Ecological genetics of Rhizobium meliloti: symbiotic plasmid transfer in the Medicago sativa rhizosphere FEMS Microbiol Lett. 40: 251.

[0187] Fellay, R., X. Perret, V. Viprey, W. J. Broughton, and S. Brenner. 1995a. Organization of host-inducible transcripts on the symbiotic plasmid of Rhizobium sp. NGR234 Mol. Microbiol 16: 657.

[0188] Freiberg, C., R. Fellay, A. Bairoch, W. J. Broughton, A. Rosenthal, and X. Perret. 1997. Molecular basis of symbiosis between Rhizobium and legumes. Nature, 387: 3q4-401.

[0189] Gray, J. X., M. A. Djordjevic, and B. G. Rolfe. 1990. Two genes that regulate exopolysaccharide production in Rhizobium sp. strain NGR234: DNA sequences and resultant phenotypes J. Bacteriol. 172: 195.

[0190] Hanin, M., S. Jabbouri, D. Quesada-Vincens, C. Freiberg, X. Perret, J.-C. Prome, W. J. Broughton, and R. Fellay. 1996. Sulphatation of Rhizobium sp. NGR234 Nod factors is dependent on noeE, a new host-specificity gene Mol Microbiol, in press.

[0191] Krishnan, H. B., C.-I. Kuo, and S. G. Pueppke. 1995. Elaboration of flavonoid-induced proteins by the nitrogen-fixing soybean symbiont Rhizobium fredii is regulated by both nodD1 and nodD2, and is dependent on the cultivar-specificity locus, nolXWBTUV Microbiology. 141: 2245.

[0192] Morrison, N. A., C. Y. Hau, M. J. Trinick, J. Shine and B. G. Rolfe. 1983. Heat curing of a sym plasmid in a fast-growing Rhizobium sp. that is able to nodulate legumes and the nonlegume Parasponia sp. J. Bacteriol. 153: 427.

[0193] Nakai, K. and M. Kanehisa. 1992. Expert system for predicting protein localization sites in Gram-negative bacteria. PROTEINS: STructure, Functions, and Genetics 11: 95-110.

[0194] Piper, K. R., S. Beck von Bodman, and S. K. Farrand. 1993. Conjugation factor of Agrobacterium tumefaciens regulates Ti plasmid transfer by autoinduction Nature 362: 448.

[0195] Sullivan, J. T., H. N. Patrick, W. L. Lowther, D. B. Scott, and C. W. Ronson. 1995. Nodulating strains of Rhizobium loti arise through chromosomal symbiotic gene transfer in the environment Proc. Natl. Acad. Sci., 92: 8985.

[0196] van Slooten, J. C., E. Cervantes, W. J. Broughton, C.-H. Wong, and J. Stanley. 1990. Sequence and analysis of the rpoN sigma factor gene of Rhizobium sp. strain NGR234 J. Bacteriol. 172: 5563.

[0197] van Slooten, J. C., T. V. Bhuvanaswari, S. Bardin, and J. Stanley. 1992. Two C4-dicarboxylate transport systems in Rhizobium sp. NGR234: rhizobial dicarboxylate transport is essential for nitrogen fixation in tropical legume symbioses Mol. Plant-Microbe Interact. 5: 179.

[0198] Zhang, L-H., P. J. Murphy, A. Kerr, and M. E. Tate. 1993. Agrobacterium conjugation and gene regulation by N-acyl-L-homoserine lactones Nature 362: 446.

0 SEQUENCE LISTING The patent application contains a lengthy “Sequence Listing” section. A copy of the “Sequence Listing” is available in electronic form from the USPTO web site (http://seqdata.uspto.gov/sequence.html?DocID=20030054522). An electronic copy of the “Sequence Listing” will also be available from the USPTO upon request and payment of the fee set forth in 37 CFR 1.19(b)(3). 

1. The nucleotide sequence as shown in FIG. 3 or degenerate variants thereof.
 2. The nucleotide sequence of claim 1 which has been altered by mutation, deletion or insertion.
 3. ORFs derivable from the nucleotide sequence of claim 1 or claim 2; excluding the ORFS identified as y4aL, y4hB, y4hG, y4hH, y4hI, y4nB, y4nC, y4sM, y4vK, y4xA, y4yC, y4yD, y4yE, y4yG in Table
 3. 4. ORFs y4aA to y4aS, y4bA to y4bO, y4cA to y4cQ, y4dA to y4dX, y4eA to y4eO, y4fA to y4fR, y4gA to y4gN, y4hA to y4hR, y4iC to y4iR, y4jA to y4jT, y4kA to y4kV, y4lA to y4lS, y4mA to y4mQ, y4nA to y4nM, y4oA to y4oX, y4pA to y4pO, y4qB to y4qK, y4rA to y4rO, y4sA to y4sL, y4sN to y4sO, y4tA to y4tS, y4uA to y4uP, y4vA to y4vS, y4wA to y4wM, y4xA to y4xQ, y4yA to y4yS as identified in Table 3; excluding the ORFs identified as y4aL, y4hB, y4hG, y4hH, y4hI, y4nB, y4nC, y4vK, y4xA, y4yC, y4yD, y4yE, y4yG in Table
 3. 5. The ORFs of claim 3 or claim 4 which encode the functions of: (a) nitrogen fixation, (b) nodulation, (c) transportation or permeation, (d) synthesis and modification of surface poly- or oligosaccharides, lipo-oligosaccharides or secreted oligosaccharide derivatives, (e) secretion (of proteins or other biomolecules), (f) transcriptional regulation or DNA-binding, (g) peptidolysis or proteolysis, (h) transposition or integration, (i) plasmid stability, plasmid replication or conjugal plasmid transfer, (j) stress response (such as heat shock, cold shock or osmotic shock), (k) chemotaxis, (l) electron transfer, (m) synthesis of isoprenoid compounds, (n) synthesis of cell wall components, (o) rhizopine metabolism, (p) synthesis and utilization of amino acids, rhizopines, amino acid derivatives or other biomolecules, or (q) degradation of xenobiotic compounds, or which encode: (r) proteins exhibiting similarities to proteins of amino acid metabolism or related ORFs, or (r) enzymes (such as oxidoreductase, transferase, hydrolase, lyase, isomerase or ligase).
 6. The ORFs of any one of claims 3 to 5 which are under the control of their natural regulatory elements or under the control of analogues to such natural regulatory elements.
 7. Intergenic sequences derivable from the nucleotide sequence of claim 1 or claim
 2. 8. The intergenic sequences of claim 7 which are regulatory DNA sequences or repeated elements.
 9. The intergenic sequences of claim 7 which are ORF-fragments.
 10. Mobile elements (insertion elements or mosaic elements) derivable from the nucleotide sequence of claim 1 or claim
 2. 11. Proteins expressible from the nucleotide sequences or ORFs of any one of claims 1 to
 6. 12. Use of the nucleotide sequences or ORFs of any one of claims 1 to 10 or the proteins of claim 11 in the analysis of the structure, organisation or dynamics of other genomes.
 13. Use of the nucleotide sequences or ORFs of any one of claims 1 to 10 or the proteins of claim 11 in: (a) screening nucleotide sequences, (b) subcloning nucleotide sequences, (c) amplifying nucleotide sequences by PCR, or (d) gene trapping.
 14. Use according to claim 13, wherein said nucleotide sequences are coding sequences or non-coding sequences.
 15. Use according to claim 14, wherein said coding sequences are regulatory sequences, repeated elements, mosaic sequences or insertion elements.
 16. Use according to any one of claims 12 to 15, wherein said nucleotide sequences or ORFs are oligonucleotide primers or hybridization probes.
 17. Use of the nucleotide sequences, individual ORFs or groups of ORFs of any one of claims 1 to 10 or the proteins of claim 11 in: (a) the identification and classification of organisms and their genetic information, (b) the identification and characterisation of nucleotide sequences, amino acid sequences or proteins, (c) the transportation of compounds to and from an organism which is host to at least one of said nucleotide sequences, ORFs or proteins, (d) the degradation and/or metabolism of organic, inorganic, natural or xenobiotic substances in a host organism, or (e) the modification of the host-range, nitrogen fixation abilities, fitness or competitiveness of organisms.
 18. The plasmid comprising the nucleotide sequence of claim 1 or claim
 2. 19. A plasmid which harbours at least one ORF of any one of claims 1 to 10 or any degenerate variant thereof or which harbours at least one ORF or any degenerate variant thereof which encodes one or more of the proteins of claim
 11. 20. The plasmid of claim 18 or claim 19 produced recombinantly.
 21. The plasmid of any one of claims 18 to 20 or any variant thereof produced by mutation, deletion, insertion or inactivation of an ORF, ORFs or groups of ORFs.
 22. Use of the plasmid of any one of claims 18 to 21 in: (a) obtaining a synthetic minimal set of ORFs required for functional Rhizobium-legume symbiosis, (b) the modification of the host-range of rhizobia, (c) the augmentation of the fitness or competitiveness of Rhizobium sp. NGR234 in the soil and its nodulation efficiency on host plants, (d) the introduction of desired phenotype(s) into host plants using said plasmid as a stable shuttle system for foreign DNA encoding said desired phenotype(s), or (e) the direct transfer of said plasmid into rhizobia or other microorganisms without using other vectors for mobilization. 